page Tutorial: Bootstrap

Representing BNF

Here is one more collection of semantic actions which turn out to be essential for page itself. Very early a hand-crafted nest of calls to factory methods of page.BNF was shown which represented the ambiguous grammar

%token Number '[0-9]+';

addition: Number;
addition: Number '+' addition;

as a page.BNF object.

A similar hand-crafted collection, page.BNF.ctor, is built into page which represents a grammar for the grammar description language BNF itself as a page.BNF object. Fortunately, this grammar for BNF is LL(1). Therefore, the corresponding page.BNF object can be wrapped as a page.LL1 object and used for parsing grammars described in BNF such as the one for addition:

parser = new page.LL1(page.BNF.ctor())       // a parser for BNF
input = parser.tokenizer()( Grammar.value )  // the grammar above...
print( parser.parser(input) )                // ...is a BNF sentence

The result is null because there is no observer. That the grammar for BNF is actually involved can be demonstrated either with page.trace or with a deliberate syntax error such as inserting an additional semicolon somewhere.

The hand-crafted construction for addition showed how this grammar can be represented as a page.BNF object. The construction can also be accomplished with a suitable collection of Builder functions handed to the parser object constructed from the BNF grammar:

addition = parser.parser(tokens, parser.reflect(new page.BNF(), page.BNF.builder)) [0];
print( addition )

produces

 0 $accept: addition $eof;
 1 addition: Number;
 2 addition: Number '+' addition;

non-terminals: $accept addition
literals: $eof '+'
tokens: $error 
        Number '[0-9]+'

As discussed previously, addition is an ambiguous grammar; however, if represented as a page.BNF object, it can still be wrapped as a page.LL1 object and used for parsing a single Number:

a = new page.LL1(addition)
print( a.parser(a.tokenizer()( '1' ), page.trace()) )

produces

addition has ambiguous alternatives

AT  TUPLE            ACTION                          RETURN
0   (2) Number '1'   rule   $accept: addition $eof;  null
1   (2) Number '1'   rule   addition: Number;        null
1   (2) Number '1'   shift  Number                   null
1   (0) $eof         reduce addition: Number;        null
1   (0) $eof         shift  $eof                     null
0   (0) $eof         reduce $accept: addition $eof;  null
null

The ability to turn a textual grammar description written in BNF such as addition into a page.BNF object clearly is important. This feature is implemented as page.BNF.from using exactly the process described here. The crucial Builder collection page.BNF.builder is part of page:

page.BNF.builder = {
  grammar: // tokens rules
    function (_values, _bnf) {
      _bnf.init(null, _values);
      return _bnf;
    },

The functions expect a page.BNF object as environment and will call its factory methods. At the top level of the grammar the rules are handed to page.BNF#init and the page.BNF object is returned, ready to be used.

  token: // '%token' Name Literal ';'
    function (_values, _bnf, _parser) {
      if (_values[0] in _bnf.tokensByName)
        _bnf.error('('+_parser.current.lineno+')', _values[0], 'is already defined');
      else
        _bnf.token(_values[0], page.unescape(_values[1]));
      return []; // only collect tokens in _bnf
    },

A token definition is checked to report duplicates and usually the details are handed to the factory method page.BNF#token to define the token name. The page.BNF object collects the tokens; therefore an empty list is returned to the reflect observer.

The grammar rules

tokens: ;
tokens: token tokens;
rules: rule more_rules;
more_rules: ;
more_rules: rule more_rules;

need no Builder functions. Values are collected by default, i.e., tokens contributes nothing and rules will deliver to grammar above just the values which each reduce for rule builds:

  rule: // nt ':' rhs ';'
    function (_values, _bnf) {
      return _bnf.rule(_values[0], _values.slice(1));
    },

rule expects a non-terminal from nt and a list of symbols from rhs, uses the factory method to represent a rule, and returns the result.

  nt: // Name
    function (_values, _bnf, _parser) {
      if (_values[0] in _bnf.tokensByName) {
        _bnf.error('('+_parser.current.lineno+')', _values[0], 'is already defined as a token');
        _values[0] += '$'; // kludge
      }
      return _bnf.nt(_values[0]);
    },

nt expects a name which should not have been defined as a token, uses the factory method to represent a non-terminal, and returns the result.

This function demonstrates a problem with this style of programming: If the result is not a non-terminal, rule would still use it as such as an argument for the factory method. Practically all methods of page defend strongly against illegal arguments and further processing would be aborted. Parsing defends pretty well against an unexpected number of arguments, but the semantic actions need to make sure that semantic aspects such as typing are correct.

The grammar rules

rhs: ;
rhs: token_or_nt rhs;
rhs: lit rhs;

need no Builder functions. Values are collected by default, i.e., rhs ends up with a list of objects produced by lit and token_or_nt.

  token_or_nt: // Name
    function (_values, _bnf) {
      if (_values[0] in _bnf.tokensByName)
        return _bnf.token(_values[0]);
      return _bnf.nt(_values[0]);
    },

token_or_nt needs to make sure that a name is represented as a token, if it has been defined as such previously, or as a non-terminal otherwise. page.BNF#init will eventually verify that all non-terminals have been defined.

  lit: // Literal
    function (_values, _bnf) {
      return _bnf.lit(page.unescape(_values[0]));
    }
};

Literal is a token for the BNF grammar, i.e., lit collects the input value and uses the factory method page.BNF#lit to represent the literal in the new grammar. page.unescape converts between the external representation of the literal in the input (surrounded by quotes, with backslash escapes) and the internal value (no quotes, backslashes acted upon). The need for such a conversion is typical, the actual implementation is language-specific. BNF uses the typical conventions of JavaScript and other languages and they are implemented in page.unescape.

The overall architecture of page.BNF.builder is typical: token values usually have to be handed to Builder functions immediately to represent them as needed and to ensure that they have the appropriate types. E.g., in a language with different types of numbers the proper types might have to be attached. The middle level of the grammar collects values into various structures, e.g., individual rules for BNF. The start symbol, finally, is responsible for final semantic checks, e.g., page.BNF#init for BNF grammars, and for presentation as the result of parsing.

Representing EBNF

page.EBNF.from takes a grammar description in extended BNF and represents it as a page.EBNF object. This could have been implemented based on a grammar of extended BNF written in BNF but as a second proof of concept for the bootstrap process there is the hand-crafted construction page.EBNF.ctor which represents the grammar for extended BNF as a page.EBNF object and the Builder collection page.EBNF.builder which calls the factory methods of page.EBNF as appropriate.

The process is the same as in the preceding section. Here is the grammar for sums discussed earlier:

%token Number '[0-9]+';

sum: Number (add | sub)*;
add: '+' Number;
sub: '-' Number;

This time the hand-crafted page.EBNF object for the grammar for extended BNF needs to be wrapped and used as a parser for the sum grammar above:

parser = new page.LL1(page.EBNF.ctor())      // a parser for EBNF
input = parser.tokenizer()( Grammar.value )  // the grammar above...
print( parser.parser(input) )                // ...is a EBNF sentence

Again, the result is null because the grammar for sum is an EBNF sentence. To turn the description into a page.EBNF object we need Builder functions corresponding to the EBNF parser.

sum = parser.parser(input, parser.reflect(new page.EBNF(), page.EBNF.builder)) [0]
print( sum )

produces

 0 $accept: sum $eof;
 1 sum: Number (add | sub)*;
 2 add: '+' Number;
 3 sub: '-' Number;

non-terminals: $accept sum add sub
literals: $eof '-' '+'
tokens: $error 
        Number '[0-9]+'

The grammar for sum is not ambiguous. It can be wrapped into an page.LL1 object

s = new page.LL1(sum)                           // a parser for sums
input = s.tokenizer()( Input.value )            // the input in the Input area...
print( s.parser(input, page.trace(/reduce/)) )  // ...is a sum sentence

and can be used to parse input such as

1 + 3 - 5 + 7

which results in the following trace of reduce actions:

AT  TUPLE            ACTION                          RETURN
2   (2) '-'          reduce add: '+' Number;         null
3   (2) '+'          reduce sub: '-' Number;         null
2   (0) $eof         reduce add: '+' Number;         null
1   (0) $eof         reduce sum: Number (add | sub)  null
0   (0) $eof         reduce $accept: sum $eof;       null
null

The collection of Builder functions for the EBNF grammar for EBNF, page.EBNF.builder, is also part of page. It turns out that the collection can use some of the same functions as page.BNF.builder; therefore, reflect supports something like inheritance for builders:

page.EBNF.builder = {
  '{}': page.BNF.builder,

If a Builder collection contains the special name {} it should refer to another Builder collection. If a name cannot be found in the first collection, the search continues along the chain created with {}. Here, the functions for the following grammar rules are inherited:

grammar: tokens rules;
token: '%token' Name Literal ';';
nt: Name;
token_or_nt: Name;

The environment for page.EBNF.builder is a page.EBNF object, i.e., the builder will use the overridden factory method page.EBNF#nt rather than page.BNF#nt. This is a bit more complicated for rules which have different arguments and a different structure for page.EBNF#rule, i.e., the rule function cannot be inherited:

  rule:                      // rule: nt ':' alt ';';
    function (_values, _ebnf, _parser) {
      if (!_values[0].rule)
        return _ebnf.rule(_values[0], _values[1]);
      _ebnf.error('('+_parser.current.lineno+') '+_values[0]+' can only have one rule');
      return [];
    },

The implementation of suffixes is similar to one implementation of arithmetic shown earlier:

  opt:                       // opt: '?';
    function (_values, _ebnf) {
      return function (term) { return _ebnf.opt(term); }
    },

opt and the other suffix rules create functions which will call the proper factory method.

  item: // term ( opt | some | many )?
    function (_values, _ebnf) {
      if (_values.length == 1)
        return _values[0];
      return _values[1](_values[0]);
    },

item applies the function created by the suffix. This example of a Builder function shows that a rule may well encounter a variable number of values and take different actions depending on how many values were actually collected. In the absence of errors parsing will ensure that the collected values correspond to the tokens and non-terminals in the respective rule (where a non-terminal can add any number of values to the collection, however). It is up to the Builder function to handle the values.

Wirth Syntax Notation

There are many different notations similar to the extended BNF language introduced here. The earliest, Wirth Syntax Notation, was proposed by Niklaus Wirth in 1977. He used brackets to enclose an optional phrase and braces to enclose something that could occur zero and more times. His notation requires only a small change to term and can be described in extended BNF as follows:

%token Name    '[a-zA-Z][-a-zA-Z0-9_]*';
%token Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\'';  

grammar: token* rule+;
token: '%token' Name Literal ';';
rule: nt ':' alt ';';
nt: Name;
alt: seq +/ '|';
seq: term+;

term: opt | many | '(' alt ')' | token_or_nt | lit;
opt: '[' alt ']';
many: '{' alt '}';

token_or_nt: Name;
lit: Literal;

With the bootstrap process described in the previous section a page.EBNF object can be constructed for this grammar:

parser = new page.LL1(page.EBNF.ctor())      // a parser for EBNF
input = parser.tokenizer()( Grammar.value )  // the grammar above
wsn = parser.parser(input, parser.reflect(new page.EBNF(), page.EBNF.builder)) [0]

The grammar is LL(1) and can be wrapped into a page.LL1 object and can then parse grammars described in Wirth Syntax Notation:

w = new page.LL1(wsn)                  // a parser for WSN
winput = w.tokenizer()( Input.value )  // the input below...
w.parser(winput)     // ...can be checked if it is a WSN sentence

In particular, Wirth Syntax Notation can be described in Wirth Syntax Notation. This just means to replace the iteration constructs in grammar, alt, and seq:

%token Name    '[a-zA-Z][-a-zA-Z0-9_]*';
%token Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\'';  

grammar: { token } rule { rule };

token: '%token' Name Literal ';';
rule: nt ':' alt ';';
nt: Name;

alt: seq { '|' seq };
seq: term { term };

term: opt | many | '(' alt ')' | token_or_nt | lit;
opt: '[' alt ']';
many: '{' alt '}';
token_or_nt: Name;
lit: Literal;

Wirth Syntax Notation cannot directly express one or more iterations; therefore, some duplication of non-terminals is required.

The grammar is quite close to the original grammar for extended BNF. It is tempting to create a Builder collection to represent grammars desribed in Wirth Syntax Notation as page.EBNF objects:

wsnActions = {
  '{}': page.EBNF.builder,

  opt: // '[' alt ']'
    function (_values, _ebnf) { return _ebnf.opt(_values[0]);  },

  many: // '{' alt '}'
    function (_values, _ebnf) { return _ebnf.many(_values[0]);  },
}

Most semantic actions can be inherited, only opt and many need to be overridden to use the factory methods page.EBNF#opt and page.EBNF#many directly. As a result the parser w for Wirth Syntax Notation can be observed to build a page.EBNF object for the grammar winput above which describes Wirth Syntax Notation in Wirth Syntax Notation

w2 = w.parser(winput, w.reflect(new page.EBNF(), wsnActions))[0]

and this object in turn can be wrapped to work with its own description

new page.LL1(w2).parser(winput)

What is the point of this exercise? It shows that page can support a new grammar description language, namely Wirth Syntax Notation. Extended BNF has disappeared, only the classes remain which provide the underlying functionality — representing a grammar, creating first and follow, checking for LL(1), parsing, and observing.

Wirth Syntax Notation was easy to implement because it has less functionality than extended BNF. On the other hand, there could be more classes, e.g., to support a notation for an exact number of iterations, or for a range or a maximum:

four-items 4
range-in-list 2:4/ ','
up-to :10

Counted iterations are awkward because they have to be enumerated for BNF, but the heavy lifting could be done by page.

BNF is all you need

page supports BNF and extended BNF and has some algorithms, most notably LL(1) checking and parsing, specifically implemented for extended BNF to demonstrate algorithm principles and provide better error messages. This section shows a way to support extended BNF that would not require page.EBNF or any of the related classes.

Parsing extended BNF requires a grammar which this time has to be written in BNF:

%token Name    '[a-zA-Z][-a-zA-Z0-9_]*';
%token Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\'';

grammar: tokens rules;

tokens: ;
tokens: token tokens;

token: '%token' Name Literal ';';

rules: rule more_rules;
more_rules: ;
more_rules: rule more_rules;

rule: nt ':' alt ';';

nt: Name;

alt: seq more_alt;
more_alt: ;
more_alt: '|' seq more_alt;

seq: item more_seq;
more_seq: ;
more_seq: item more_seq;

item: term suffix;

suffix: ;
suffix: opt;
suffix: some;
suffix: many;

opt: '?';
some: '+' list;
many: '*' list;

list: ;
list: '/' term;

term: token_or_nt;
term: lit;
term: '(' alt ')';

token_or_nt: Name;
lit: Literal;

Repetitions such as token*, rules+, seq +/ '|', and item+ have to be expressed using additional non-terminals, alternatives such as opt | some | many have to be specified as multiple right-hand sides. All of this is taken care of when page.EBNF#init translates an EBNF grammar into BNF and is documented in a table in the method documentation.

The grammar above can be represented as a page.BNF object:

bnf = new page.LL1(page.BNF.ctor())       // a parser for BNF
input = bnf.tokenizer()( Grammar.value )  // the grammar for EBNF in BNF above...
e = bnf.parser(input, bnf.reflect(new page.BNF(), page.BNF.builder))[0]

This produces exactly the BNF rules above:

 0 $accept: grammar $eof;
 1 grammar: tokens rules;
 2 tokens: ;
 3 tokens: token tokens;
 4 token: '%token' Name Literal ';';
...

The grammar is LL(1) and can be wrapped into page.LL1 and used for parsing a grammar written in extended BNF, for example the sum grammar discussed earlier:

%token Number '[0-9]+';

sum: num (add | sub)*;
add: '+' num;
sub: '-' num;
num: Number;

The BNF object can use the page.EBNF.builder collection of semantic actions to represent the sum grammar as a page.EBNF object:

ebnf = new page.LL1(e)                    // a parser for EBNF represented in BNF
input = ebnf.tokenizer()( Input.value )   // the grammar in the Input area
ebnf.parser(input, ebnf.reflect(new page.EBNF(), page.EBNF.builder))[0]

This produces

 0 $accept: sum $eof;
 1 sum: num (add | sub)*;
 2 add: '+' num;
 3 sub: '-' num;
 4 num: Number;

No surprise: these are the EBNF rules. To make do with BNF alone requires a collection of Builder functions which correspond to the grammar for extended BNF but use the factory methods of page.BNF, i.e., represent each extended BNF construct immediately with BNF classes:

BNFbuilder = {
  '{}': page.BNF.builder, // inherited

  rule: // nt ':' alt ';'
    function (_values, _bnf) {
      var nt = _values[0];
      for (var n = 1; n < _values.length; ++ n)
        _bnf.rules.push(_bnf.rule(nt, [ _values[n] ]));
      return nt;
    },

Very many semantic actions can rely on the default of collecting values, and a few can be inherited from page.BNF.builder. A rule can have a choice construct on the right-hand side, which has to be translated into several BNF rules for the non-terminal.

The semantic actions for tokens are inherited and collect no values. The rule returns the non-terminal so that eventually the semantic action for grammar can specify the start symbol (non-terminal of the first rule) when the page.BNF object is completed:

  grammar: // tokens rules
    function (_values, _bnf) {
      _bnf.init(_values[0]);
      return _bnf;
    },

Choices and sequences need a more or less anonymous non-terminal to create rules for. Such a non-terminal can be created with the page.BNF#nt factory method and '$' as a special argument:

  alt:
    function (_values, _bnf) {
      if (_values.length == 1) return _values[0];
      var nt = _bnf.nt('$');
      for (var n = 0; n < _values.length; ++ n)
        _bnf.rules.push(_bnf.rule(nt, [_values[n]]));
      return nt;
    },

  seq:
    function (_values, _bnf) {
      if (_values.length == 1) return _values[0];
      var nt = _bnf.nt('$');
      _bnf.rules.push(_bnf.rule(nt, _values));
      return nt;
    },

The semantic action for item may have to consider a suffix. Actions for the various suffixes are expected to return functions which can process the term collected for item:

  item: // term suffix
    function (_values, _bnf) {
      return _values.length == 1 ? _values[0] : _values[1](_values[0]);
    },

An optional term requires an anonymous non-terminal which has an empty right-hand side and a right-hand side with the term:

  opt: // '?'
    function (_values, _bnf) {
      return function (term) {
        var nt = _bnf.nt('$');
        _bnf.rules.push(_bnf.rule(nt, [ ]));
        _bnf.rules.push(_bnf.rule(nt, [ term ]));
        return nt;
      };
    },

The returned function which item calls creates two rules.

The iterations are more complicated because they might have to account for delimiters:

  many: // '*' list
    function (_values, _bnf) {
      if (_values.length == 0)
        return function (term) {
          var nt = _bnf.nt('$');
          _bnf.rules.push(_bnf.rule(nt, [ ]));
          _bnf.rules.push(_bnf.rule(nt, [ term, nt ]));
          return nt;
        };

If there is no delimiter, many returns a function which translates term* into the right-recursive rules

$#: ;
$#: term $#;

using another unique, anonymous non-terminal, here called $#.

If a delimiter value has been collected for many, it takes four rules and two anonymous non-terminals to translate

term */ delim

to BNF:

$#: ;
$#: term $##;
$##: ;
$##: delim term $##;

Here is the function which many will return for a delimited list:

      else
        return function (term) {
          var nt = _bnf.nt('$'), nt2 = _bnf.nt('$');
          _bnf.rules.push(_bnf.rule(nt, [ ]));
          _bnf.rules.push(_bnf.rule(nt, [ term, nt2 ]));
          _bnf.rules.push(_bnf.rule(nt2, [ ]));
          _bnf.rules.push(_bnf.rule(nt2, [ _values[0], term, nt2 ]));
          return nt;
        };
    }

item will execute this function and return the first non-terminal.

Finally, term+ takes three right-recursive rules and two new non-terminals:

$#: term $##;
$##: ;
$##: term $##;

and term +/ delim takes almost the same rules:

$#: term $##;
$##: ;
$##: delim term $##

Here is the semantic action for some which implements this translation:

  some: // '+' list
    function (_values, _bnf) {
      if (_values.length == 0)
        return function (term) {
          var nt = _bnf.nt('$'), nt2 = _bnf.nt('$');
          _bnf.rules.push(_bnf.rule(nt, [ term, nt2 ]));
          _bnf.rules.push(_bnf.rule(nt2, [ ]));
          _bnf.rules.push(_bnf.rule(nt2, [ term, nt2 ]));
          return nt;
        };
      else
        return function (term) {
          var nt = _bnf.nt('$'), nt2 = _bnf.nt('$');
          _bnf.rules.push(_bnf.rule(nt, [ term, nt2 ]));
          _bnf.rules.push(_bnf.rule(nt2, [ ]));
          _bnf.rules.push(_bnf.rule(nt2, [ _values[0], term, nt2 ]));
          return nt;
        };
    },
};

The parser ebnf created from the BNF representation of the grammar for extended BNF can use this collection of semantic actions to represent the sum grammar (which is LL(1)) as a BNF object which can be wrapped into page.LL1 and used to parse a sequence of numbers and operators:

sum = ebnf.parser(input, ebnf.reflect(new page.BNF(), BNFbuilder))[0]
s = new page.LL1(sum)                     // a parser for sums
input = s.tokenizer()('1 - 3 + 5 - 7')    // a sequence of numbers and operators
s.parser(input, page.trace(s.reflect(), /reduce/))  // ...can be parsed

This produces

AT  TUPLE            ACTION                          RETURN
11  (1) '-'          reduce num: Number;             [ ] 1
11  (1) '+'          reduce num: Number;             [ ] 3
9   (1) '+'          reduce $9: '-' num;             [ ] 3
10  (1) '+'          reduce sub: $9;                 [ ] 3
2   (1) '+'          reduce $5: sub;                 [ ] 3
11  (1) '-'          reduce num: Number;             [ ] 5
7   (1) '-'          reduce $8: '+' num;             [ ] 5
8   (1) '-'          reduce add: $8;                 [ ] 5
1   (1) '-'          reduce $5: add;                 [ ] 5
11  (0) $eof         reduce num: Number;             [ ] 7
9   (0) $eof         reduce $9: '-' num;             [ ] 7
10  (0) $eof         reduce sub: $9;                 [ ] 7
2   (0) $eof         reduce $5: sub;                 [ ] 7
4   (0) $eof         reduce $6: $5 $6;               [ ] 7
4   (0) $eof         reduce $6: $5 $6;               [ ] 5,7
4   (0) $eof         reduce $6: $5 $6;               [ ] 3,5,7
5   (0) $eof         reduce $7: num $6;              [ ] 1,3,5,7
6   (0) $eof         reduce sum: $7;                 [ ] 1,3,5,7
0   (0) $eof         reduce $accept: sum $eof;       [ ] 1,3,5,7
1,3,5,7

Mission accomplished: sum and ebnf do not involve anything from the page.EBNF realm. All it takes is BNF!

Just for the fun of it, the BNF-based EBNF parser ebnf can take page.EBNF.grammar, the grammar for extended BNF described in extended BNF, use the Builder actions in BNFbuilder above to represent the grammar as a BNF object which can then be wrapped into page.LL1 and used to parse itself:

input = ebnf.tokenizer()(page.EBNF.grammar)  // the EBNF grammar in EBNF...
                                // ...built for BNF with the EBNF parser...
e2 = ebnf.parser(input, ebnf.reflect(new page.BNF(), BNFbuilder))[0]
new page.LL1(e2).parser(input)  // ...can parse itself

This returns null, but we could again observe it with BNFbuilder actions and repeat a few more times...

Representing BNF

Representing EBNF

Wirth Syntax Notation

BNF is all you need

Next: LR(1) Previous: Observers

Search results