Representing BNF
Here is one more collection of semantic actions which turn out to be essential for page itself. Very early a hand-crafted nest of calls to factory methods of page.BNF was shown which represented the ambiguous grammar
%token Number '[0-9]+';
addition: Number;
addition: Number '+' addition;
as a page.BNF object.
A similar hand-crafted collection, page.BNF.ctor, is built into page which represents
a grammar for the grammar description language BNF itself as a page.BNF object.
Fortunately, this grammar for BNF is LL(1).
Therefore, the corresponding page.BNF object can be wrapped as a page.LL1 object and used
for parsing grammars described in BNF such as the one for addition
:
parser = new page.LL1(page.BNF.ctor()) // a parser for BNF
input = parser.tokenizer()( Grammar.value ) // the grammar above...
print( parser.parser(input) ) // ...is a BNF sentence
The result is null
because there is no observer.
That the grammar for BNF is actually involved can be demonstrated either with page.trace
or with a deliberate syntax error such as inserting an additional semicolon somewhere.
The hand-crafted construction for addition
showed how this grammar can be represented as a page.BNF object.
The construction can also be accomplished with a suitable collection of Builder functions
handed to the parser
object constructed from the BNF grammar:
addition = parser.parser(tokens, parser.reflect(new page.BNF(), page.BNF.builder)) [0];
print( addition )
produces
0 $accept: addition $eof;
1 addition: Number;
2 addition: Number '+' addition;
non-terminals: $accept addition
literals: $eof '+'
tokens: $error
Number '[0-9]+'
As discussed previously, addition
is an ambiguous grammar; however, if represented as a page.BNF object,
it can still be wrapped as a page.LL1 object and used for parsing a single Number
:
a = new page.LL1(addition)
print( a.parser(a.tokenizer()( '1' ), page.trace()) )
produces
addition has ambiguous alternatives
AT TUPLE ACTION RETURN
0 (2) Number '1' rule $accept: addition $eof; null
1 (2) Number '1' rule addition: Number; null
1 (2) Number '1' shift Number null
1 (0) $eof reduce addition: Number; null
1 (0) $eof shift $eof null
0 (0) $eof reduce $accept: addition $eof; null
null
The ability to turn a textual grammar description written in BNF such as addition
into a page.BNF object clearly is important.
This feature is implemented as page.BNF.from using exactly the process
described here. The crucial Builder collection page.BNF.builder is part of page:
page.BNF.builder = {
grammar: // tokens rules
function (_values, _bnf) {
_bnf.init(null, _values);
return _bnf;
},
The functions expect a page.BNF object as environment and will call its factory methods. At the top level of the grammar the rules are handed to page.BNF#init and the page.BNF object is returned, ready to be used.
token: // '%token' Name Literal ';'
function (_values, _bnf, _parser) {
if (_values[0] in _bnf.tokensByName)
_bnf.error('('+_parser.current.lineno+')', _values[0], 'is already defined');
else
_bnf.token(_values[0], page.unescape(_values[1]));
return []; // only collect tokens in _bnf
},
A token definition is checked to report duplicates and usually the details
are handed to the factory method page.BNF#token to define the token name.
The page.BNF object collects the tokens; therefore an empty list
is returned to the reflect
observer.
The grammar rules
tokens: ;
tokens: token tokens;
rules: rule more_rules;
more_rules: ;
more_rules: rule more_rules;
need no Builder functions. Values are collected by default,
i.e., tokens
contributes nothing and rules
will deliver to grammar
above just the values which each reduce
for rule
builds:
rule: // nt ':' rhs ';'
function (_values, _bnf) {
return _bnf.rule(_values[0], _values.slice(1));
},
rule
expects a non-terminal from nt
and a list of symbols from rhs
,
uses the factory method to represent a rule, and returns the result.
nt: // Name
function (_values, _bnf, _parser) {
if (_values[0] in _bnf.tokensByName) {
_bnf.error('('+_parser.current.lineno+')', _values[0], 'is already defined as a token');
_values[0] += '$'; // kludge
}
return _bnf.nt(_values[0]);
},
nt
expects a name which should not have been defined as a token,
uses the factory method to represent a non-terminal, and returns the result.
This function demonstrates a problem with this style of programming:
If the result is not a non-terminal, rule
would still use it as such
as an argument for the factory method.
Practically all methods of page defend strongly against illegal arguments
and further processing would be aborted.
Parsing defends pretty well against an unexpected number of arguments,
but the semantic actions need to make sure that semantic aspects such as typing are correct.
The grammar rules
rhs: ;
rhs: token_or_nt rhs;
rhs: lit rhs;
need no Builder functions. Values are collected by default,
i.e., rhs
ends up with a list of objects produced by lit
and token_or_nt
.
token_or_nt: // Name
function (_values, _bnf) {
if (_values[0] in _bnf.tokensByName)
return _bnf.token(_values[0]);
return _bnf.nt(_values[0]);
},
token_or_nt
needs to make sure that a name is represented as a token,
if it has been defined as such previously, or as a non-terminal otherwise.
page.BNF#init will eventually verify that all non-terminals have been defined.
lit: // Literal
function (_values, _bnf) {
return _bnf.lit(page.unescape(_values[0]));
}
};
Literal
is a token for the BNF grammar, i.e., lit
collects the input value
and uses the factory method page.BNF#lit to represent the literal in the new grammar.
page.unescape converts between the external representation of the literal in the input
(surrounded by quotes, with backslash escapes) and the internal value (no quotes, backslashes acted upon).
The need for such a conversion is typical, the actual implementation is language-specific.
BNF uses the typical conventions of JavaScript and other languages
and they are implemented in page.unescape.
The overall architecture of page.BNF.builder is typical: token values usually have to be handed to Builder functions immediately to represent them as needed and to ensure that they have the appropriate types. E.g., in a language with different types of numbers the proper types might have to be attached. The middle level of the grammar collects values into various structures, e.g., individual rules for BNF. The start symbol, finally, is responsible for final semantic checks, e.g., page.BNF#init for BNF grammars, and for presentation as the result of parsing.
Representing EBNF
page.EBNF.from takes a grammar description in extended BNF and represents it as a page.EBNF object. This could have been implemented based on a grammar of extended BNF written in BNF but as a second proof of concept for the bootstrap process there is the hand-crafted construction page.EBNF.ctor which represents the grammar for extended BNF as a page.EBNF object and the Builder collection page.EBNF.builder which calls the factory methods of page.EBNF as appropriate.
The process is the same as in the preceding section. Here is the grammar for sums discussed earlier:
%token Number '[0-9]+';
sum: Number (add | sub)*;
add: '+' Number;
sub: '-' Number;
This time the hand-crafted page.EBNF object for the grammar for extended BNF needs to be wrapped
and used as a parser for the sum
grammar above:
parser = new page.LL1(page.EBNF.ctor()) // a parser for EBNF
input = parser.tokenizer()( Grammar.value ) // the grammar above...
print( parser.parser(input) ) // ...is a EBNF sentence
Again, the result is null
because the grammar for sum
is an EBNF sentence. To turn the description
into a page.EBNF object we need Builder functions corresponding to the EBNF parser.
sum = parser.parser(input, parser.reflect(new page.EBNF(), page.EBNF.builder)) [0]
print( sum )
produces
0 $accept: sum $eof;
1 sum: Number (add | sub)*;
2 add: '+' Number;
3 sub: '-' Number;
non-terminals: $accept sum add sub
literals: $eof '-' '+'
tokens: $error
Number '[0-9]+'
The grammar for sum
is not ambiguous. It can be wrapped into an page.LL1 object
s = new page.LL1(sum) // a parser for sums
input = s.tokenizer()( Input.value ) // the input in the Input area...
print( s.parser(input, page.trace(/reduce/)) ) // ...is a sum sentence
and can be used to parse input such as
1 + 3 - 5 + 7
which results in the following trace of reduce
actions:
AT TUPLE ACTION RETURN
2 (2) '-' reduce add: '+' Number; null
3 (2) '+' reduce sub: '-' Number; null
2 (0) $eof reduce add: '+' Number; null
1 (0) $eof reduce sum: Number (add | sub) null
0 (0) $eof reduce $accept: sum $eof; null
null
The collection of Builder functions for the EBNF grammar for EBNF, page.EBNF.builder,
is also part of page.
It turns out that the collection can use some of the same functions as page.BNF.builder;
therefore, reflect
supports something like inheritance for builders:
page.EBNF.builder = {
'{}': page.BNF.builder,
If a Builder collection contains the special name {}
it should refer to another Builder collection.
If a name cannot be found in the first collection, the search continues along the chain created with {}
.
Here, the functions for the following grammar rules are inherited:
grammar: tokens rules;
token: '%token' Name Literal ';';
nt: Name;
token_or_nt: Name;
The environment for page.EBNF.builder is a page.EBNF object, i.e., the builder will use
the overridden factory method page.EBNF#nt rather than page.BNF#nt.
This is a bit more complicated for rules which have different arguments and a different structure for page.EBNF#rule,
i.e., the rule
function cannot be inherited:
rule: // rule: nt ':' alt ';';
function (_values, _ebnf, _parser) {
if (!_values[0].rule)
return _ebnf.rule(_values[0], _values[1]);
_ebnf.error('('+_parser.current.lineno+') '+_values[0]+' can only have one rule');
return [];
},
The implementation of suffixes is similar to one implementation of arithmetic shown earlier:
opt: // opt: '?';
function (_values, _ebnf) {
return function (term) { return _ebnf.opt(term); }
},
opt
and the other suffix rules create functions which will call the proper factory method.
item: // term ( opt | some | many )?
function (_values, _ebnf) {
if (_values.length == 1)
return _values[0];
return _values[1](_values[0]);
},
item
applies the function created by the suffix. This example of a Builder function shows that a rule may well
encounter a variable number of values and take different actions depending on how many values were actually collected.
In the absence of errors parsing will ensure that the collected values correspond to the tokens and non-terminals
in the respective rule (where a non-terminal can add any number of values to the collection, however).
It is up to the Builder function to handle the values.
Wirth Syntax Notation
There are many different notations similar to the extended BNF language introduced here.
The earliest, Wirth Syntax Notation,
was proposed by Niklaus Wirth in 1977. He used brackets to enclose an optional phrase
and braces to enclose something that could occur zero and more times.
His notation requires only a small change to term
and can be described in extended BNF as follows:
%token Name '[a-zA-Z][-a-zA-Z0-9_]*';
%token Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\'';
grammar: token* rule+;
token: '%token' Name Literal ';';
rule: nt ':' alt ';';
nt: Name;
alt: seq +/ '|';
seq: term+;
term: opt | many | '(' alt ')' | token_or_nt | lit;
opt: '[' alt ']';
many: '{' alt '}';
token_or_nt: Name;
lit: Literal;
With the bootstrap process described in the previous section a page.EBNF object can be constructed for this grammar:
parser = new page.LL1(page.EBNF.ctor()) // a parser for EBNF
input = parser.tokenizer()( Grammar.value ) // the grammar above
wsn = parser.parser(input, parser.reflect(new page.EBNF(), page.EBNF.builder)) [0]
The grammar is LL(1) and can be wrapped into a page.LL1 object and can then parse grammars described in Wirth Syntax Notation:
w = new page.LL1(wsn) // a parser for WSN
winput = w.tokenizer()( Input.value ) // the input below...
w.parser(winput) // ...can be checked if it is a WSN sentence
In particular, Wirth Syntax Notation can be described in Wirth Syntax Notation.
This just means to replace the iteration constructs in grammar
, alt
, and seq
:
%token Name '[a-zA-Z][-a-zA-Z0-9_]*';
%token Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\'';
grammar: { token } rule { rule };
token: '%token' Name Literal ';';
rule: nt ':' alt ';';
nt: Name;
alt: seq { '|' seq };
seq: term { term };
term: opt | many | '(' alt ')' | token_or_nt | lit;
opt: '[' alt ']';
many: '{' alt '}';
token_or_nt: Name;
lit: Literal;
Wirth Syntax Notation cannot directly express one or more iterations; therefore, some duplication of non-terminals is required.
The grammar is quite close to the original grammar for extended BNF. It is tempting to create a Builder collection to represent grammars desribed in Wirth Syntax Notation as page.EBNF objects:
wsnActions = {
'{}': page.EBNF.builder,
opt: // '[' alt ']'
function (_values, _ebnf) { return _ebnf.opt(_values[0]); },
many: // '{' alt '}'
function (_values, _ebnf) { return _ebnf.many(_values[0]); },
}
Most semantic actions can be inherited, only opt
and many
need to be overridden
to use the factory methods page.EBNF#opt and page.EBNF#many directly.
As a result the parser w
for Wirth Syntax Notation can be observed to build a page.EBNF object
for the grammar winput
above which describes Wirth Syntax Notation in Wirth Syntax Notation
w2 = w.parser(winput, w.reflect(new page.EBNF(), wsnActions))[0]
and this object in turn can be wrapped to work with its own description
new page.LL1(w2).parser(winput)
What is the point of this exercise? It shows that page can support a new grammar description language,
namely Wirth Syntax Notation. Extended BNF has disappeared, only the classes remain which provide
the underlying functionality — representing a grammar, creating first
and follow
, checking for LL(1), parsing, and observing.
Wirth Syntax Notation was easy to implement because it has less functionality than extended BNF. On the other hand, there could be more classes, e.g., to support a notation for an exact number of iterations, or for a range or a maximum:
four-items 4
range-in-list 2:4/ ','
up-to :10
Counted iterations are awkward because they have to be enumerated for BNF, but the heavy lifting could be done by page.
BNF is all you need
page supports BNF and extended BNF and has some algorithms, most notably LL(1) checking and parsing, specifically implemented for extended BNF to demonstrate algorithm principles and provide better error messages. This section shows a way to support extended BNF that would not require page.EBNF or any of the related classes.
Parsing extended BNF requires a grammar which this time has to be written in BNF:
%token Name '[a-zA-Z][-a-zA-Z0-9_]*';
%token Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\'';
grammar: tokens rules;
tokens: ;
tokens: token tokens;
token: '%token' Name Literal ';';
rules: rule more_rules;
more_rules: ;
more_rules: rule more_rules;
rule: nt ':' alt ';';
nt: Name;
alt: seq more_alt;
more_alt: ;
more_alt: '|' seq more_alt;
seq: item more_seq;
more_seq: ;
more_seq: item more_seq;
item: term suffix;
suffix: ;
suffix: opt;
suffix: some;
suffix: many;
opt: '?';
some: '+' list;
many: '*' list;
list: ;
list: '/' term;
term: token_or_nt;
term: lit;
term: '(' alt ')';
token_or_nt: Name;
lit: Literal;
Repetitions such as token*
, rules+
, seq +/ '|'
, and item+
have to be expressed using additional non-terminals,
alternatives such as opt | some | many
have to be specified as multiple right-hand sides.
All of this is taken care of when page.EBNF#init translates an EBNF grammar into BNF
and is documented in a table in the method documentation.
The grammar above can be represented as a page.BNF object:
bnf = new page.LL1(page.BNF.ctor()) // a parser for BNF
input = bnf.tokenizer()( Grammar.value ) // the grammar for EBNF in BNF above...
e = bnf.parser(input, bnf.reflect(new page.BNF(), page.BNF.builder))[0]
This produces exactly the BNF rules above:
0 $accept: grammar $eof;
1 grammar: tokens rules;
2 tokens: ;
3 tokens: token tokens;
4 token: '%token' Name Literal ';';
...
The grammar is LL(1) and can be wrapped into page.LL1 and used for parsing a grammar written in extended BNF,
for example the sum
grammar discussed earlier:
%token Number '[0-9]+';
sum: num (add | sub)*;
add: '+' num;
sub: '-' num;
num: Number;
The BNF object can use the page.EBNF.builder collection of semantic actions to represent the sum
grammar
as a page.EBNF object:
ebnf = new page.LL1(e) // a parser for EBNF represented in BNF
input = ebnf.tokenizer()( Input.value ) // the grammar in the Input area
ebnf.parser(input, ebnf.reflect(new page.EBNF(), page.EBNF.builder))[0]
This produces
0 $accept: sum $eof;
1 sum: num (add | sub)*;
2 add: '+' num;
3 sub: '-' num;
4 num: Number;
No surprise: these are the EBNF rules. To make do with BNF alone requires a collection of Builder functions which correspond to the grammar for extended BNF but use the factory methods of page.BNF, i.e., represent each extended BNF construct immediately with BNF classes:
BNFbuilder = {
'{}': page.BNF.builder, // inherited
rule: // nt ':' alt ';'
function (_values, _bnf) {
var nt = _values[0];
for (var n = 1; n < _values.length; ++ n)
_bnf.rules.push(_bnf.rule(nt, [ _values[n] ]));
return nt;
},
Very many semantic actions can rely on the default of collecting values,
and a few can be inherited from page.BNF.builder.
A rule
can have a choice construct on the right-hand side,
which has to be translated into several BNF rules for the non-terminal.
The semantic actions for tokens are inherited and collect no values.
The rule
returns the non-terminal so that eventually the semantic action for grammar
can specify
the start symbol (non-terminal of the first rule) when the page.BNF object is completed:
grammar: // tokens rules
function (_values, _bnf) {
_bnf.init(_values[0]);
return _bnf;
},
Choices and sequences need a more or less anonymous non-terminal
to create rules for. Such a non-terminal can be created with the page.BNF#nt factory method
and '$'
as a special argument:
alt:
function (_values, _bnf) {
if (_values.length == 1) return _values[0];
var nt = _bnf.nt('$');
for (var n = 0; n < _values.length; ++ n)
_bnf.rules.push(_bnf.rule(nt, [_values[n]]));
return nt;
},
seq:
function (_values, _bnf) {
if (_values.length == 1) return _values[0];
var nt = _bnf.nt('$');
_bnf.rules.push(_bnf.rule(nt, _values));
return nt;
},
The semantic action for item
may have to consider a suffix.
Actions for the various suffixes are expected to return functions which can
process the term collected for item
:
item: // term suffix
function (_values, _bnf) {
return _values.length == 1 ? _values[0] : _values[1](_values[0]);
},
An optional term requires an anonymous non-terminal which has an empty right-hand side and a right-hand side with the term:
opt: // '?'
function (_values, _bnf) {
return function (term) {
var nt = _bnf.nt('$');
_bnf.rules.push(_bnf.rule(nt, [ ]));
_bnf.rules.push(_bnf.rule(nt, [ term ]));
return nt;
};
},
The returned function which item
calls creates two rules.
The iterations are more complicated because they might have to account for delimiters:
many: // '*' list
function (_values, _bnf) {
if (_values.length == 0)
return function (term) {
var nt = _bnf.nt('$');
_bnf.rules.push(_bnf.rule(nt, [ ]));
_bnf.rules.push(_bnf.rule(nt, [ term, nt ]));
return nt;
};
If there is no delimiter, many
returns a function which translates term*
into
the right-recursive rules
$#: ;
$#: term $#;
using another unique, anonymous non-terminal, here called $#
.
If a delimiter value has been collected for many
, it takes four rules and two anonymous non-terminals
to translate
term */ delim
to BNF:
$#: ;
$#: term $##;
$##: ;
$##: delim term $##;
Here is the function which many
will return for a delimited list:
else
return function (term) {
var nt = _bnf.nt('$'), nt2 = _bnf.nt('$');
_bnf.rules.push(_bnf.rule(nt, [ ]));
_bnf.rules.push(_bnf.rule(nt, [ term, nt2 ]));
_bnf.rules.push(_bnf.rule(nt2, [ ]));
_bnf.rules.push(_bnf.rule(nt2, [ _values[0], term, nt2 ]));
return nt;
};
}
item
will execute this function and return the first non-terminal.
Finally, term+
takes three right-recursive rules and two new non-terminals:
$#: term $##;
$##: ;
$##: term $##;
and term +/ delim
takes almost the same rules:
$#: term $##;
$##: ;
$##: delim term $##
Here is the semantic action for some
which implements this translation:
some: // '+' list
function (_values, _bnf) {
if (_values.length == 0)
return function (term) {
var nt = _bnf.nt('$'), nt2 = _bnf.nt('$');
_bnf.rules.push(_bnf.rule(nt, [ term, nt2 ]));
_bnf.rules.push(_bnf.rule(nt2, [ ]));
_bnf.rules.push(_bnf.rule(nt2, [ term, nt2 ]));
return nt;
};
else
return function (term) {
var nt = _bnf.nt('$'), nt2 = _bnf.nt('$');
_bnf.rules.push(_bnf.rule(nt, [ term, nt2 ]));
_bnf.rules.push(_bnf.rule(nt2, [ ]));
_bnf.rules.push(_bnf.rule(nt2, [ _values[0], term, nt2 ]));
return nt;
};
},
};
The parser ebnf
created from the BNF representation of the grammar for extended BNF
can use this collection of semantic actions to represent the sum
grammar (which is LL(1))
as a BNF object which can be wrapped into page.LL1 and used to parse a sequence of numbers and operators:
sum = ebnf.parser(input, ebnf.reflect(new page.BNF(), BNFbuilder))[0]
s = new page.LL1(sum) // a parser for sums
input = s.tokenizer()('1 - 3 + 5 - 7') // a sequence of numbers and operators
s.parser(input, page.trace(s.reflect(), /reduce/)) // ...can be parsed
This produces
AT TUPLE ACTION RETURN
11 (1) '-' reduce num: Number; [ ] 1
11 (1) '+' reduce num: Number; [ ] 3
9 (1) '+' reduce $9: '-' num; [ ] 3
10 (1) '+' reduce sub: $9; [ ] 3
2 (1) '+' reduce $5: sub; [ ] 3
11 (1) '-' reduce num: Number; [ ] 5
7 (1) '-' reduce $8: '+' num; [ ] 5
8 (1) '-' reduce add: $8; [ ] 5
1 (1) '-' reduce $5: add; [ ] 5
11 (0) $eof reduce num: Number; [ ] 7
9 (0) $eof reduce $9: '-' num; [ ] 7
10 (0) $eof reduce sub: $9; [ ] 7
2 (0) $eof reduce $5: sub; [ ] 7
4 (0) $eof reduce $6: $5 $6; [ ] 7
4 (0) $eof reduce $6: $5 $6; [ ] 5,7
4 (0) $eof reduce $6: $5 $6; [ ] 3,5,7
5 (0) $eof reduce $7: num $6; [ ] 1,3,5,7
6 (0) $eof reduce sum: $7; [ ] 1,3,5,7
0 (0) $eof reduce $accept: sum $eof; [ ] 1,3,5,7
1,3,5,7
Mission accomplished: sum
and ebnf
do not involve anything from the page.EBNF realm.
All it takes is BNF!
Just for the fun of it, the BNF-based EBNF parser ebnf
can take page.EBNF.grammar,
the grammar for extended BNF described in extended BNF,
use the Builder actions in BNFbuilder
above to represent the grammar as a BNF object
which can then be wrapped into page.LL1 and used to parse itself:
input = ebnf.tokenizer()(page.EBNF.grammar) // the EBNF grammar in EBNF...
// ...built for BNF with the EBNF parser...
e2 = ebnf.parser(input, ebnf.reflect(new page.BNF(), BNFbuilder))[0]
new page.LL1(e2).parser(input) // ...can parse itself
This returns null
, but we could again observe it with BNFbuilder
actions and repeat a few more times...