The problem
Input is usually written as a sequence of characters from which the terminal symbols have to be extracted before the input can be matched to a grammar. page takes the view that literals represent themselves and that each token has a pattern to recognize its representations in the input.
1, 23, or other digit sequences all would represent a Number if the token is defined as
%token Number '[0-9]+';If there is another token
%token Name '[a-zA-Z][-a-zA-Z0-9_]*';sequences like Name or more-rules would represent a Name.
A regular expression recognizes the longest possible character sequence;
therefore, catch22 represents a single Name, not a Name followed by the Number 22.
A literal such as 'if' would appear as if in the input and cannot represent a Name.
But the longer sequence iffy represents a single Name, not if followed by the Name fy.
Tokenizer
Given a page.BNF object representing a grammar, page can create a Tokenizer, i.e., a function which accepts an input string and extracts a list of terminal symbols. We can try this with the BNF grammar object returned by page.BNF.ctor:
$ jjs -scripting page.min.js -
jjs> t = page.BNF.ctor().tokenizer()
function tokenizer (_input) {
...
}This particular function is created from the description of BNF. Therefore it will extract the terminal symbols which are used to describe BNF, for example:
jjs> t(' %token: + %tokens ').join('\n')
(1) '%token'
(1) ':'
(1) $error '+'
(1) '%token'
(1) Name 's'
It returns an array with literals
such as %token or a colon, and a Name represented as the letter s
immediately following the second %token.
Illegal input characters, if any, are reported as the special token $error.
White space is discarded.
The result of a Tokenizer call is an array of page.Tuple elements.
Each element contains the line number where the symbol was found,
together with the classification — literal, token, or $error — and the string representing the symbol.
The implementation
For most grammars, the token patterns are relatively simple,
and the literals are words like if and one- or two-letter operators such as
parentheses, +, and <=.
Therefore, most of the time
input can be dissected using a pattern composed of
the patterns for the tokens, a pattern enumerating all literals, longer literals first,
and a pattern for input characters to be ignored.
The pattern can be inspected as a member .pattern of the Tokenizer function
returned by page.BNF#tokenizer:
jjs> page.BNF.ctor().tokenizer().pattern
([a-zA-Z][-a-zA-Z0-9_]*)|('(?:[^'
\\]|\\['n\\])+')|(\;|\:|\%token)|(\s+)
This is daunting for BNF because the pattern for quoted literals is complicated,
but a pattern for a grammar such as
addition
is much simpler:
([0-9]+)|(\+)|(\s+)There is the pattern for a Number, the literal is +, and whitespace is ignored.
What should be ignored is specified as a pattern argument to tokenizer
and becomes the last part of the .pattern.
The default is /\s+/, i.e., to ignore whitespace.
For grammars, comments extend from # to the end of the line;
this is specified as /\s+|#.*/.
If the pattern /.|\n/ is passed to tokenizer, the resulting function ignores all unknown characters.
Note that in these patterns the simple (grouping) parentheses are essential for the dissection process.
A skip or token pattern should only employ non-capturing parentheses (?: ) as in the pattern for quoted literals for BNF.
page