Another way to describe grammars
Grammars for programming languages contain a lot of choices and iterations. Both are awkward to describe using BNF: each choice requires a non-terminal to join the alternatives together and each iteration requires a non-terminal for recursion.
Consider addition
as defined previously:
a sequence of one or more numbers separated by plus operators.
Extended BNF adds ways to describe choices and iterations to BNF
which cut down on the need for non-terminals. For example:
addition: Number ( '+' Number )*;
addition
is a Number
followed by zero or more groups of +
and a Number
.
Parentheses are used for grouping and nesting.
A suffix applied to a symbol or group indicates repetition:
the suffix *
means zero or more times,
+
means one or more times,
and ?
means that the symbol or group is optional (it can be specified or omitted).
One can view a subtraction
as a list of one or more Number
values separated by -
operators:
subtraction: Number +/ '-';
Two infix operations are used to describe lists:
+/
indicates a list of one or more elements and
*/
indicates a list which can be empty.
The left operand, a symbol or group, describes each list element;
the right operand, another symbol or group, describes the delimiter between any two list elements.
All of these can be combined in a new grammar which uses calculations
as a start symbol.
calculations
consist of one or more addition
or subtraction
:
%token Number '[0-9]+';
calculations: ( addition | subtraction )+;
addition: Number ( '+' Number )*;
subtraction: Number +/ '-';
The infix operator |
combines choices.
The parentheses are necessary in the first rule.
Precedence increases from choice |
in extended BNF
to juxtaposing in BNF,
and from there to the lists +/
and */
and the suffixes +
, *
, and ?
of extended BNF.
Representing an EBNF grammar
An EBNF grammar is represented as an instance of page.EBNF using the factory methods page.BNF#lit, and page.BNF#token discussed previously.
The representations of non-terminals and rules are slightly different because a non-terminal must be defined by a single rule. A rule has this kind of non-terminal as a left-hand side and a single eNode object as a right-hand side. Two factory methods, page.EBNF#nt and page.EBNF#rule, are overrides to produce the modified objects. page.EBNF#init is an override to assign the start symbol and rules to the instance. New factory methods support the new notations:
page.EBNF#alt corresponds to the operator
|
and returns a new page.EBNF.Alt object which represents a set of choices.page.EBNF#seq corresponds to parentheses and returns a new page.EBNF.Seq object which represents a group of items.
page.EBNF#opt corresponds to the suffix
?
and returns a new page.EBNF.Rep object which represents optional use of an item.page.EBNF#many corresponds to the suffix
*
and returns a new page.EBNF.Rep object which represents zero or more uses of an item.page.EBNF#some corresponds to the suffix
+
and returns a new page.EBNF.Rep object which represents one or more uses of an item.page.EBNF#manyList corresponds to the operator
*/
and returns a new page.EBNF.Lst object which represents a list of zero or more uses of an item separated by a delimiter.page.EBNF#someList corresponds to the operator
+/
and returns a new page.EBNF.Lst object which represents a list of one or more uses of an item separated by a delimiter.
For example, here is the construction for the calculations
:
var g = new page.EBNF();
g.init(g.nt('calculations'), [ // start symbol, list of rules
g.rule(g.nt('calculations'), // calculations: ( addition | subtraction )+;
g.some(
g.alt([ g.nt('addition'), g.nt('subtraction') ])
)),
g.rule(g.nt('addition'), // addition: Number ( '+' Number )*;
g.seq([
g.token('Number', '[0-9]+'), // token: digits for a Number
g.many( // sequence of zero or more of...
g.seq([ g.lit('+'), g.token('Number')
]))
])),
g.rule(g.nt('subtraction'), // subtraction: Number +/ '-';
g.someList( // list of one or more...
g.token('Number'), // list item
g.lit('-') // delimiter
))
]);
Another selfie
The language extended BNF is used just like BNF to describe grammars and, therefore, has a grammar which again can be textually represented using either one of the BNF or extended BNF languages. The representation of extended BNF in extended BNF, page.EBNF.grammar, is built into page as a string:
$ jjs -scripting page.min.js -
jjs> page.EBNF.grammar
%token Name '[a-zA-Z][-a-zA-Z0-9_]*';
%token Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\'';
grammar: token* rule+;
token: '%token' Name Literal ';';
Patterns and the requirement to place token definitions before the grammar rules remain unchanged, but the new suffixes are used to represent the sequence of tokens and non-empty sequence of rules.
rule: nt ':' alt ';';
nt: Name;
The right-hand side of a rule now consists of choices and for extended BNF there should be only one rule for each non-terminal.
alt: seq +/ '|';
The list of choices is separated by |
and cannot be empty.
Each alternative consists of one ore more, optionally suffixed terms:
seq: item+;
item: term ( opt | some | many )?;
opt: '?';
some: '+' ( '/' term )?;
many: '*' ( '/' term )?;
The suffixes are ?
, +
, and *
, and either of the last two can optionally be specified with a term as a delimiter.
Note that the last two rules permit white space, etc., between +
or *
and /
.
term: token_or_nt | lit | '(' alt ')';
token_or_nt: Name;
lit: Literal;
As in BNF a term
can be a token, non-terminal name or literal,
but in extended BNF a term can also be anything — enclosed in parentheses — which can appear on the right-hand side of a rule.
What's in that grammar object?
The construction page.EBNF.ctor for extended BNF is built into page, too:
jjs> page.EBNF.ctor
function () {
var g = new page.EBNF();
g.init(g.nt('grammar'),
[ g.rule(g.nt('grammar'), g.seq([ g.many(g.nt('token')), g.some(g.nt('rule')) ])),
...
jjs> page.EBNF.ctor().toString()
0 $accept: grammar $eof
1 grammar: token* rule+
2 token: '%token' Name Literal ';'
3 rule: nt ':' alt ';'
4 nt: Name
5 alt: seq +/ '|'
6 seq: item+
7 item: term (opt | some | many)?
8 opt: '?'
9 some: '+' ('/' term)?
10 many: '*' ('/' term)?
11 term: token_or_nt | lit | '(' alt ')'
12 token_or_nt: Name
13 lit: Literal
non-terminals: $accept grammar token rule nt alt seq item term opt some many token_or_nt lit
literals: $eof '|' '?' ';' ':' '/' '+' '*' ')' '(' '%token'
tokens: $error
Name '[a-zA-Z][-a-zA-Z0-9_]*'
Literal '\'(?:[^\'\n\\\\]|\\\\[\'n\\\\])+\''
Extended BNF is only a shorthand for BNF. Once a page.EBNF object is initialized it converts its grammar to BNF:
jjs> page.EBNF.ctor().bnf.toString()
0 $accept: grammar $eof;
1 grammar: $2 $4;
2 $2: ;
3 $2: token $2;
4 $4: rule $6;
5 $6: ;
6 $6: rule $6;
7 token: '%token' Name Literal ';';
...
Terminals and the names of the non-terminals are always shared between the EBNF and BNF representations.
Rule 0 is the same for BNF and extended BNF. Rules 1 and 2 are now rules 1 through 6 and 7,
and three new non-terminals and five new rules are required
to implement the suffix *
for token
(new rules 2 and 3)
and the suffix +
for rule
(new rules 4 through 6).
Left or right recursion is used for the expansion
depending whether or not an additional argument for page.EBNF#init is true.
Initial EBNF grammar checks
When an EBNF grammar construction is concluded with page.EBNF#init, the EBNF grammar is translated and represented as a BNF grammar. At this point page.BNF#init performs the usual checks for the BNF grammar. Therefore:
All names in either grammar must be uniquely defined, either as token names using
%token
, or as non-terminal names which must appear on the left-hand side of single EBNF rules.All non-terminals in either grammar should be necessary, i.e., they must be reachable through a chain of right-hand sides from the internal start symbol
$accept
.Syntax trees based on EBNF grammars would be relatively flat but we can resort to the BNF translation to produce deeper trees. In either case, non-terminals should still have a way to produce terminals:
a: 'ok' | b; b: c; c: b; error: b is infinitely recursive error: c is infinitely recursive
Note that the error messages stem from the BNF grammar,
i.e., they might refer to some synthetic non-terminals such as $2
.