page Tutorial: Dissecting input

The problem

Input is usually written as a sequence of characters from which the terminal symbols have to be extracted before the input can be matched to a grammar. page takes the view that literals represent themselves and that each token has a pattern to recognize its representations in the input.

1, 23, or other digit sequences all would represent a Number if the token is defined as

%token Number '[0-9]+';

If there is another token

%token Name '[a-zA-Z][-a-zA-Z0-9_]*';

sequences like Name or more-rules would represent a Name. A regular expression recognizes the longest possible character sequence; therefore, catch22 represents a single Name, not a Name followed by the Number 22.

A literal such as 'if' would appear as if in the input and cannot represent a Name. But the longer sequence iffy represents a single Name, not if followed by the Name fy.

`Tokenizer`

Given a page.BNF object representing a grammar, page can create a Tokenizer, i.e., a function which accepts an input string and extracts a list of terminal symbols. We can try this with the BNF grammar object returned by page.BNF.ctor:

$ jjs -scripting page.min.js -
jjs> t = page.BNF.ctor().tokenizer()
function tokenizer (_input) {
  ...
}

This particular function is created from the description of BNF. Therefore it will extract the terminal symbols which are used to describe BNF, for example:

jjs> t(' %token: + %tokens ').join('\n')
(1) '%token'
(1) ':'
(1) $error '+'
(1) '%token'
(1) Name 's'

It returns an array with literals such as %token or a colon, and a Name represented as the letter s immediately following the second %token. Illegal input characters, if any, are reported as the special token $error. White space is discarded.

The result of a Tokenizer call is an array of page.Tuple elements. Each element contains the line number where the symbol was found, together with the classification — literal, token, or $error — and the string representing the symbol.

The implementation

For most grammars, the token patterns are relatively simple, and the literals are words like if and one- or two-letter operators such as parentheses, +, and <=. Therefore, most of the time input can be dissected using a pattern composed of the patterns for the tokens, a pattern enumerating all literals, longer literals first, and a pattern for input characters to be ignored. The pattern can be inspected as a member .pattern of the Tokenizer function returned by page.BNF#tokenizer:

jjs> page.BNF.ctor().tokenizer().pattern
([a-zA-Z][-a-zA-Z0-9_]*)|('(?:[^'
\\]|\\['n\\])+')|(\;|\:|\%token)|(\s+)

This is daunting for BNF because the pattern for quoted literals is complicated, but a pattern for a grammar such as addition is much simpler:

([0-9]+)|(\+)|(\s+)

There is the pattern for a Number, the literal is +, and whitespace is ignored.

What should be ignored is specified as a pattern argument to tokenizer and becomes the last part of the .pattern. The default is /\s+/, i.e., to ignore whitespace. For grammars, comments extend from # to the end of the line; this is specified as /\s+|#.*/. If the pattern /.|\n/ is passed to tokenizer, the resulting function ignores all unknown characters.

Note that in these patterns the simple (grouping) parentheses are essential for the dissection process. A skip or token pattern should only employ non-capturing parentheses (?: ) as in the pattern for quoted literals for BNF.

The problem

Tokenizer

The implementation

Next: Syntax trees Previous: Grammars and BNF

Search results

`Tokenizer`