The problem
Input is usually written as a sequence of characters from which the terminal symbols have to be extracted before the input can be matched to a grammar. page takes the view that literals represent themselves and that each token has a pattern to recognize its representations in the input.
1
, 23
, or other digit sequences all would represent a Number
if the token is defined as
%token Number '[0-9]+';
If there is another token
%token Name '[a-zA-Z][-a-zA-Z0-9_]*';
sequences like Name
or more-rules
would represent a Name
.
A regular expression recognizes the longest possible character sequence;
therefore, catch22
represents a single Name
, not a Name
followed by the Number
22
.
A literal such as 'if'
would appear as if
in the input and cannot represent a Name
.
But the longer sequence iffy
represents a single Name
, not if
followed by the Name
fy
.
Tokenizer
Given a page.BNF object representing a grammar, page can create a Tokenizer, i.e., a function which accepts an input string and extracts a list of terminal symbols. We can try this with the BNF grammar object returned by page.BNF.ctor:
$ jjs -scripting page.min.js -
jjs> t = page.BNF.ctor().tokenizer()
function tokenizer (_input) {
...
}
This particular function is created from the description of BNF. Therefore it will extract the terminal symbols which are used to describe BNF, for example:
jjs> t(' %token: + %tokens ').join('\n')
(1) '%token'
(1) ':'
(1) $error '+'
(1) '%token'
(1) Name 's'
It returns an array with literals
such as %token
or a colon, and a Name
represented as the letter s
immediately following the second %token
.
Illegal input characters, if any, are reported as the special token $error
.
White space is discarded.
The result of a Tokenizer call is an array of page.Tuple elements.
Each element contains the line number where the symbol was found,
together with the classification — literal, token, or $error
— and the string representing the symbol.
The implementation
For most grammars, the token patterns are relatively simple,
and the literals are words like if
and one- or two-letter operators such as
parentheses, +
, and <=
.
Therefore, most of the time
input can be dissected using a pattern composed of
the patterns for the tokens, a pattern enumerating all literals, longer literals first,
and a pattern for input characters to be ignored.
The pattern can be inspected as a member .pattern
of the Tokenizer function
returned by page.BNF#tokenizer:
jjs> page.BNF.ctor().tokenizer().pattern
([a-zA-Z][-a-zA-Z0-9_]*)|('(?:[^'
\\]|\\['n\\])+')|(\;|\:|\%token)|(\s+)
This is daunting for BNF because the pattern for quoted literals is complicated,
but a pattern for a grammar such as
addition
is much simpler:
([0-9]+)|(\+)|(\s+)
There is the pattern for a Number
, the literal is +
, and whitespace is ignored.
What should be ignored is specified as a pattern argument to tokenizer
and becomes the last part of the .pattern
.
The default is /\s+/
, i.e., to ignore whitespace.
For grammars, comments extend from #
to the end of the line;
this is specified as /\s+|#.*/
.
If the pattern /.|\n/
is passed to tokenizer
, the resulting function ignores all unknown characters.
Note that in these patterns the simple (grouping) parentheses are essential for the dissection process.
A skip or token pattern should only employ non-capturing parentheses (?: )
as in the pattern for quoted literals for BNF.