andre/30-seconds-of-code

Fork 0

Files

History

Stefan Fejes cc8f1d8a7a WIP - add extractor, generate snippet_data

2019-08-20 15:52:05 +02:00

lib

WIP - add extractor, generate snippet_data

2019-08-20 15:52:05 +02:00

index.js

WIP - add extractor, generate snippet_data

2019-08-20 15:52:05 +02:00

license

WIP - add extractor, generate snippet_data

2019-08-20 15:52:05 +02:00

package.json

WIP - add extractor, generate snippet_data

2019-08-20 15:52:05 +02:00

readme.md

WIP - add extractor, generate snippet_data

2019-08-20 15:52:05 +02:00

readme.md

parse-latin

A Latin script language parser for retext producing NLCST nodes.

Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”), parse-latin does a good job at tokenising it.

Note also that parse-latin does a decent job at tokenising Latin-like scripts, Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի է”), and such.

Installation

npm:

npm install parse-latin

Usage

var inspect = require('unist-util-inspect')
var Latin = require('parse-latin')

var tree = new Latin().parse('A simple sentence.')

console.log(inspect(tree))

Which, when inspecting, yields:

RootNode[1] (1:1-1:19, 0-18)
└─ ParagraphNode[1] (1:1-1:19, 0-18)
   └─ SentenceNode[6] (1:1-1:19, 0-18)
      ├─ WordNode[1] (1:1-1:2, 0-1)
      │  └─ TextNode: "A" (1:1-1:2, 0-1)
      ├─ WhiteSpaceNode: " " (1:2-1:3, 1-2)
      ├─ WordNode[1] (1:3-1:9, 2-8)
      │  └─ TextNode: "simple" (1:3-1:9, 2-8)
      ├─ WhiteSpaceNode: " " (1:9-1:10, 8-9)
      ├─ WordNode[1] (1:10-1:18, 9-17)
      │  └─ TextNode: "sentence" (1:10-1:18, 9-17)
      └─ PunctuationNode: "." (1:18-1:19, 17-18)

API

`ParseLatin(value)`

Exposes the functionality needed to tokenise natural Latin-script languages into a syntax tree. If value is passed here, it’s not needed to give it to #parse().

`ParseLatin#tokenize(value)`

Tokenise value (string) into letters and numbers (words), white space, and everything else (punctuation). The returned nodes are a flat list without paragraphs or sentences.

Returns

Array.<NLCSTNode> — Nodes.

`ParseLatin#parse(value)`

Tokenise value (string) into an NLCST tree. The returned node is a RootNode with in it paragraphs and sentences.

Returns

NLCSTNode — Root node.

Algorithm

Note: The easiest way to see how parse-latin tokenizes and parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.

parse-latin splits text into white space, word, and punctuation tokens. parse-latin starts out with a pretty easy definition, one that most other tokenisers use:

A “word” is one or more letter or number characters
A “white space” is one or more white space characters
A “punctuation” is one or more of anything else

Then, it manipulates and merges those tokens into an NLCST syntax tree, adding sentences and paragraphs where needed.

Some punctuation marks are part of the word they occur in, e.g., non-profit, she’s, G.I., 11:00, N/A, &c, nineteenth- and...
Some full-stops do not mark a sentence end, e.g., 1., e.g., id.
Although full-stops, question marks, and exclamation marks (sometimes) end a sentence, that end might not occur directly after the mark, e.g., .), ."
And many more exceptions

readme.md Unescape Escape

parse-latin

Installation

Usage

API

ParseLatin(value)

ParseLatin#tokenize(value)

Returns

ParseLatin#parse(value)

Returns

Algorithm

License

readme.md

`ParseLatin(value)`

`ParseLatin#tokenize(value)`

`ParseLatin#parse(value)`