Terminology
For the purposes of page a BNF grammar consists of a start symbol, a nonempty list of rules, and optionally a regular expression describing input characters which will be ignored.
A rule is an ordered pair consisting of a non-terminal as left-hand side and a list of symbols which may be empty as right-hand side. A symbol is a non-terminal or a terminal; a terminal is either a literal or a token.
A non-terminal must be the left-hand side in one or more rules. The start symbol must be a non-terminal. Nonterminals are identified by unique names. The factory method page.BNF#nt returns a unique page.BNF.NT object for each unique name representing a non-terminal.
A literal is a terminal expected to appear verbatim in input to a generated parser,
for example a word such as if
or an operator such as +
.
As a terminal, a literal can only appear on the right-hand side of rules.
The factory method page.BNF#lit returns a unique page.BNF.Lit object for each unique string representing a literal.
A token is a terminal designating a category of character sequences, such as a number or a variable name, which appear for the token in input to a generated parser. Tokens are identified by unique names, different from non-terminals. As a terminal, a literal can only appear on the right-hand side of rules. The factory method page.BNF#token returns a unique page.BNF.Token object for each unique name representing a token. To create the object the method requires a regular expression describing the character sequences.
Representing a BNF grammar
A BNF grammar is represented as an instance of page.BNF using the factory methods described above and page.BNF#init to assign the start symbol and the rules to the instance. For example, addition can be defined as a sequence of one or more numbers separated by plus operators:
var g = new page.BNF();
g.init(g.nt('addition'), [ // start symbol, list of rules
g.rule(g.nt('addition'), [ // rule (1) addition: Number
g.token('Number', '[0-9]+') // new token: digits represent a Number
]),
g.rule(g.nt('addition'), [ // rule (2) addition: Number '+' addition
g.token('Number'), // reference to Number token
g.lit('+'), // literal: plus operator
g.nt('addition') // reference to non-terminal
])
]);
The factory method page.BNF#rule returns a new page.BNF.Rule object with a non-terminal and a list of symbols and adds it to the non-terminal's rules.
Describing a grammar
It will be explained later
that page can convert a textual description into a grammar construction as shown above.
First in the description, each token is defined as a name and a pattern. For example, a Number
is a sequence of one or more digits:
%token Number '[0-9]+';
Then, beginning with the start symbol as first left-hand side, one or more rules are defined:
addition: Number;
addition: Number '+' addition;
A rule (ordered pair) starts with a name (non-terminal) and a colon, contains a sequence of quoted literals and names — each a token or non-terminal — and ends with a semicolon.
Note that all names must either have been defined as tokens or appear as left-hand side of one or more rules, but not both. Tokens must be defined before they can be used, non-terminal names can be forward references.
This grammar specifies that an addition
consists of a single Number
or of a Number
followed by a +
operator followed by another addition
.
Note the use of recursion to specify (infinite) sequences.
A language for grammars
The textual description suggests that there is a language, usually referred to as BNF, to describe grammars. As a language, BNF has a grammar, and this grammar can again be textually described using the BNF language. This description page.BNF.grammar is built into page as a string:
jjs> page.BNF.grammar
%token Name '[a-zA-Z][-a-zA-Z0-9_]*';
%token Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\'';
Represented by strings, these patterns are a bit complicated:
a Name
starts with a letter and then contains minus signs, letters, digits, and underscores.
A Literal
is enclosed by single quotes and can contain any character as long as
each quote, newline, or backslash are represented as \'
, \n
, or \\
.
Note that these patterns must use (:? )
rather than ( )
for grouping.
grammar: tokens rules;
tokens: ;
tokens: token tokens;
token: '%token' Name Literal ';';
A grammar
consists of tokens
followed by rules
.
There can be zero or more token
definitions in the tokens
,
and each consists of %token
followed by a Name
, a Literal
with the pattern string, and a semicolon.
rules: rule more_rules;
more_rules: ;
more_rules: rule more_rules;
rule: nt ':' rhs ';';
nt: Name;
There is at least one rule
in the rules
and each rule
consists of a nt
— which is a Name
—
followed by a colon, a right-hand side rhs
, and a semicolon.
rhs: ;
rhs: token_or_nt rhs;
rhs: lit rhs;
token_or_nt: Name;
lit: Literal;
The right-hand side consists of zero or more symbols
each of which is either a Name
for a token or non-terminal or a Literal
.
It will be explained later
that page can construct page.BNF objects from textual descriptions
using page.BNF.from.
This method allows commented descriptions;
a comment extends from #
to the end of a line.
What's in this grammar object?
To get started, the hand-crafted construction page.BNF.ctor corresponding to the description page.BNF.grammar is built into page:
jjs> page.BNF.ctor
function () {
var g = new page.BNF();
g.init(g.nt('grammar'),
[ g.rule(g.nt('grammar'), [ g.nt('tokens'), g.nt('rules') ]),
...
Every object in page can be inspected or dumped for additional information. Here we can see that this grammar object does represent the grammar for BNF:
jjs> page.BNF.ctor().toString()
0 $accept: grammar $eof
1 grammar: tokens rules
2 tokens:
3 tokens: token tokens
4 token: '%token' Name Literal ';'
5 rules: rule more_rules
6 more_rules:
7 more_rules: rule more_rules
8 rule: nt ':' rhs ';'
9 nt: Name
10 rhs:
11 rhs: token_or_nt rhs
12 rhs: lit rhs
13 token_or_nt: Name
14 lit: Literal
non-terminals: $accept grammar tokens rules token rule more_rules nt rhs token_or_nt lit
literals: $eof ';' ':' '%token'
tokens: $error
Name '[a-zA-Z][-a-zA-Z0-9_]*'
Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\''
Note that page.BNF#init adds a rule 0 to the grammar
with a special non-terminal $accept
as internal start symbol
and a symbol sequence consisting of the original start symbol followed by a special literal $eof
which represents the end of input to the generated parser.
There is also a special token $error
representing an input error.
All of these can be accessed by calling the appropriate factory method without arguments.