Grammars and BNF

Terminology

For the purposes of page a BNF grammar consists of a start symbol, a nonempty list of rules, and optionally a regular expression describing input characters which will be ignored.

A rule is an ordered pair consisting of a non-terminal as left-hand side and a list of symbols which may be empty as right-hand side. A symbol is a non-terminal or a terminal; a terminal is either a literal or a token.

A non-terminal must be the left-hand side in one or more rules. The start symbol must be a non-terminal. Nonterminals are identified by unique names. The factory method page.BNF#nt returns a unique page.BNF.NT object for each unique name representing a non-terminal.

A literal is a terminal expected to appear verbatim in input to a generated parser, for example a word such as if or an operator such as +. As a terminal, a literal can only appear on the right-hand side of rules. The factory method page.BNF#lit returns a unique page.BNF.Lit object for each unique string representing a literal.

A token is a terminal designating a category of character sequences, such as a number or a variable name, which appear for the token in input to a generated parser. Tokens are identified by unique names, different from non-terminals. As a terminal, a literal can only appear on the right-hand side of rules. The factory method page.BNF#token returns a unique page.BNF.Token object for each unique name representing a token. To create the object the method requires a regular expression describing the character sequences.

Representing a BNF grammar

A BNF grammar is represented as an instance of page.BNF using the factory methods described above and page.BNF#init to assign the start symbol and the rules to the instance. For example, addition can be defined as a sequence of one or more numbers separated by plus operators:

var g = new page.BNF();
g.init(g.nt('addition'), [           // start symbol, list of rules
    g.rule(g.nt('addition'), [       // rule (1) addition: Number
        g.token('Number', '[0-9]+')  // new token: digits represent a Number
      ]),
    g.rule(g.nt('addition'), [       // rule (2) addition: Number '+' addition
        g.token('Number'),           // reference to Number token
        g.lit('+'),                  // literal: plus operator
        g.nt('addition')             // reference to non-terminal
      ])
  ]);

The factory method page.BNF#rule returns a new page.BNF.Rule object with a non-terminal and a list of symbols and adds it to the non-terminal's rules.

Describing a grammar

It will be explained later that page can convert a textual description into a grammar construction as shown above. First in the description, each token is defined as a name and a pattern. For example, a Number is a sequence of one or more digits:

%token Number '[0-9]+';

Then, beginning with the start symbol as first left-hand side, one or more rules are defined:

addition: Number;
addition: Number '+' addition;

A rule (ordered pair) starts with a name (non-terminal) and a colon, contains a sequence of quoted literals and names — each a token or non-terminal — and ends with a semicolon.

Note that all names must either have been defined as tokens or appear as left-hand side of one or more rules, but not both. Tokens must be defined before they can be used, non-terminal names can be forward references.

This grammar specifies that an addition consists of a single Number or of a Number followed by a + operator followed by another addition. Note the use of recursion to specify (infinite) sequences.

A language for grammars

The textual description suggests that there is a language, usually referred to as BNF, to describe grammars. As a language, BNF has a grammar, and this grammar can again be textually described using the BNF language. This description page.BNF.grammar is built into page as a string:

jjs> page.BNF.grammar
%token Name    '[a-zA-Z][-a-zA-Z0-9_]*';
%token Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\'';

Represented by strings, these patterns are a bit complicated: a Name starts with a letter and then contains minus signs, letters, digits, and underscores. A Literal is enclosed by single quotes and can contain any character as long as each quote, newline, or backslash are represented as \', \n, or \\. Note that these patterns must use (:? ) rather than ( ) for grouping.

grammar: tokens rules;
tokens: ;
tokens: token tokens;
token: '%token' Name Literal ';';

A grammar consists of tokens followed by rules. There can be zero or more token definitions in the tokens, and each consists of %token followed by a Name, a Literal with the pattern string, and a semicolon.

rules: rule more_rules;
more_rules: ;
more_rules: rule more_rules;
rule: nt ':' rhs ';';
nt: Name;

There is at least one rule in the rules and each rule consists of a nt — which is a Name — followed by a colon, a right-hand side rhs, and a semicolon.

rhs: ;
rhs: token_or_nt rhs;
rhs: lit rhs;
token_or_nt: Name;
lit: Literal;

The right-hand side consists of zero or more symbols each of which is either a Name for a token or non-terminal or a Literal.

It will be explained later that page can construct page.BNF objects from textual descriptions using page.BNF.from. This method allows commented descriptions; a comment extends from # to the end of a line.

What's in this grammar object?

To get started, the hand-crafted construction page.BNF.ctor corresponding to the description page.BNF.grammar is built into page:

jjs> page.BNF.ctor
function () {
    var g = new page.BNF();
    g.init(g.nt('grammar'),
      [ g.rule(g.nt('grammar'),     [ g.nt('tokens'), g.nt('rules') ]),
...

Every object in page can be inspected or dumped for additional information. Here we can see that this grammar object does represent the grammar for BNF:

jjs> page.BNF.ctor().toString()
 0 $accept: grammar $eof
 1 grammar: tokens rules
 2 tokens:
 3 tokens: token tokens
 4 token: '%token' Name Literal ';'
 5 rules: rule more_rules
 6 more_rules:
 7 more_rules: rule more_rules
 8 rule: nt ':' rhs ';'
 9 nt: Name
10 rhs:
11 rhs: token_or_nt rhs
12 rhs: lit rhs
13 token_or_nt: Name
14 lit: Literal

non-terminals: $accept grammar tokens rules token rule more_rules nt rhs token_or_nt lit
literals: $eof ';' ':' '%token'
tokens: $error
        Name '[a-zA-Z][-a-zA-Z0-9_]*'
        Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\''

Note that page.BNF#init adds a rule 0 to the grammar with a special non-terminal $accept as internal start symbol and a symbol sequence consisting of the original start symbol followed by a special literal $eof which represents the end of input to the generated parser. There is also a special token $error representing an input error. All of these can be accessed by calling the appropriate factory method without arguments.

Next: Dissecting input