Recognition
page.SLR1 is used just like page.LL1:
a grammar is represented as an object using a factory method such as page.BNF.from
and wrapped into a new page.SLR1 object to check if the grammar is suitable for LR(1) parsing.
If so, page.SLR1#parse takes input and an optional observer function
and returns true
to request more input, or null
, or the last value from the observer.
Input is represented as a sequence of page.Tuple objects, or a function delivering such a sequence, just as described previously. Therefore, input can be pushed to page.SLR1#parse or pulled by the method as described before.
page.trace creates a suitable observer function and page.SLR1#dump will show the current stack. As an example, here is how the grammar
root: 'a';
root: 'a' 'b';
root: root 'c';
can be used to recognize a
:
// create parser and tokenizer
s = new page.SLR1(page.BNF.from( Grammar.value ));
t = s.tokenizer();
// observe recognition and display the active stack
s.parse(t('a'), page.trace());
The trace is short:
AT TUPLE ACTION RETURN
0 (1) 'a' shift 1 null
Note that a parser based on LR(1) cannot send a message at the start of a rule because it selects the rule at the very moment it is reduced.
Actually, this explains the names:
LL(1) stands for left-to-right parsing with leftmost derivation and one symbol lookahead, i.e., rules must be selected a priori, top-down.
LR(1) stands for left-to-right parsing with rightmost derivation and one symbol lookahead, i.e., rules are selected as late as possible, only when a configuration is complete, bottom-up.
s.dump()
displays the active stack (and more):
active stack:
0
1
Another push of input with s.parse(t('bcc'))
results in more trace output
1 (2) 'b' shift 3 null
3 (2) 'c' reduce root: 'a' 'b'; null
0 (2) 'c' goto 2 null
2 (2) 'c' shift 4 null
4 (2) 'c' reduce root: root 'c'; null
0 (2) 'c' goto 2 null
2 (2) 'c' shift 4 null
and s.dump()
shows how the stack has changed:
active stack:
0
2
4
Finally, s.parse(null)
concludes parsing and the trace
4 (0) $eof reduce root: root 'c'; null
0 (0) $eof goto 2 null
2 (0) $eof accept null
and returns the result null
.
The trace shows that the parser sends shift
, reduce
, and goto
messages to the observer,
accompanied by target state or rule information,
and finally an accept
message.
The first column of the trace contains the current state number.
Implementation
page.SLR1 owns the state stack so that input can be pushed. On the first call page.SLR1#parse initializes the stack:
if (!('stack' in this))
this.stack = [ { state: 0 } ];
For every call the current state object and input tuple are set up:
var state = this.states[ this.stack[this.stack.length - 1].state ];
var current = null; // current input, none yet available
The parser enters an infinite loop which will be terminated
by a need for another input sequence, an accept
action, or an Error
:
try {
while (true) {
// get an input tuple, if needed
if (!current) next();
// deal with the input
...
}
} catch (outcome) {
// request another input sequence?
if (outcome === true) return true;
// end parsing: remove stack
delete this.stack;
// really bad error?
if (outcome instanceof Error) throw outcome;
return outcome[0]; // from the observer, if any
}
page.SLR1#parse has a few local functions for abstraction.
try
and catch
are used so that, e.g., next()
can transfer input tuples
from an array or from successive function calls, one by one, to current
and throw true
to request another input sequence.
perform()
implements the various actions, communicates with the observer, if any,
and returns true
if current
is consumed.
This function throws an array with the last value
returned by the observer, if any, once accept
is reached.
Finally, recover()
implements error recovery and might throw an Error
if things go really wrong.
Given these functions dealing with input is easy:
// deal with the input
if (current.t != this.bnf.token() // not $error tuple
&& current.t.ord in state.actions) { // anticipated input
if (perform(state.actions[current.t.ord], current))
next(); // consumed
} else
recover();
An $error
tuple represents input which the tokenizer did not understand.
It, as well as an input which is not a key in the action table of the current state,
will have to be reported by recover
before error recovery is attempted.
perform
receives the action and the corresponding input and uses another local function observe()
to communicate with the observer, if any, and ensure that error messages get printed and counted.
An observer can throw a string to be printed as an error message or an Error
to abort parsing.
function perform (_action, _tuple) {
var result = observe(_tuple, _action.action, _action.info);
accept
and shift
are very simple to implement.
accept
terminates parsing and returns the result from the observer, if any.
A shift
action has the new state number as info
property.
switch (_action.action) {
case 'accept':
throw [ result ]; // success
case 'shift': // shift to new state
this.stack.push({ state: _action.info, value: result });
state = this.states[_action.info];
return true; // _tuple consumed
As discussed before, an observer will collect values and present them to semantic actions. The collection must be performed in parallel to the state changes so that the values collected for a rule can be identified. Therefore, each state change saves the value returned by the observer on the stack.
A reduce
action references the rule to be reduced so that the resulting non-terminal and the rule length can be retrieved.
reduce
is more complicated because it is always followed by a goto
action, also with a new state number.
The reduce
action produces a value from the observer, if any,
and during the state change in goto
the value is saved on the stack.
case 'reduce':
// pop the stack by the length of the rule, uncover state
this.stack.length -= _action.info.symbols.length;
state = this.states[this.stack[this.stack.length - 1].state];
// there has to be a goto for the rule's non-terminal
var g = state.actions[_action.info.nt.ord];
At this point g
must be a goto
action which is sent to the observer — mostly for tracing — and the new state
and the value returned by the observer, if any, for reduce
are pushed onto the stack:
observe(_tuple, g.action, g.info); // observe the goto
// goto to new state
this.stack.push({ state: g.info, value: result });
state = this.states[g.info];
return false; // _tuple still available
}
}
Collecting values
page.SLR1#parse has the current parser state on top of a stack.
New states are reached through shift
and goto
actions and pushed onto the stack.
For each state the stack holds an associated value which is either null
or
returned by an observer for the shift
message or for the reduce
message preceding the goto
message, respectively.
An observer is responsible for a semantic action when a rule is reduced.
The observer receives a reference to the rule as part of the reduce
message.
Therefore, the observer can access the exact number of stack entries which contain the associated values.
page.SLR1#yaccpar returns an observer function which is closely patterned after the
yacc parser generator.
For a shift
message this observer will return the string representing the terminal, literal or token, in the input.
for a reduce
message, without a Builder function, the observer will return the first collected value
or null
if the rule is empty. As an example, for the grammar
root: 'a';
root: 'a' 'b';
root: root 'c';
and the input
a b c c
the following code
s = new page.SLR1(page.BNF.from( Grammar.value ));
t = s.tokenizer();
o = page.trace(s.yaccpar());
s.parse(t( Input.value ).concat(null), o);
results in this trace:
AT TUPLE ACTION RETURN
0 (2) 'a' shift 1 str a
1 (2) 'b' shift 3 str b
3 (2) 'c' reduce root: 'a' 'b'; str a
0 (2) 'c' goto 2 null
2 (2) 'c' shift 4 str c
4 (2) 'c' reduce root: root 'c'; str a
0 (2) 'c' goto 2 null
2 (2) 'c' shift 4 str c
4 (0) $eof reduce root: root 'c'; str a
0 (0) $eof goto 2 null
2 (0) $eof accept str a
a
I.e., all the literals are collected but page.SLR1#yaccpar retains only the very first one across reduce
messages.
This changes completely if a semantic action for root
is added:
actions = {
root: function (values) { return '[' + values.join(', ') + ']'; }
};
o = page.trace(s.yaccpar(null, actions), /reduce/);
s.parse(t( Input.value ).concat(null), o);
Now the Builder function accumulates the literals and previous results and builds something close to a syntax tree:
AT TUPLE ACTION RETURN
3 (2) 'c' reduce root: 'a' 'b'; str [a, b]
4 (2) 'c' reduce root: root 'c'; str [[a, b], c]
4 (0) $eof reduce root: root 'c'; str [[[a, b], c], c]
[[[a, b], c], c]
Semantic actions
page.SLR1#reflect returns an observer which does not collect literal values.
For a reduce
message, the collected non-null
values are concatenated into an array,
i.e., null
and empty arrays are ignored and other arrays are flattened.
If there is no Builder function with the name of the non-terminal for the rule to be reduced,
the values are returned by the observer, i.e., they are collected for the next outer reduce
message.
As an example, the following grammar
%token Name '[a-z]+';
count: null empty Name double;
null: ;
empty: ;
double: Name;
can be used to parse the input
a b
and without semantic actions the observer created by page.SLR1#reflect will collect an array containing a
and b
.
With the actions
actions = {
'null': function (values) { return null; },
'double': function (values) { values.push(values[0]); return values; },
count: function (values) { return values.length; }
};
the observer will return the count 3
, i.e., the input strings a
and two b
.
This count would also result if null
were collected and the result of double
not flattened;
however, two push
operations in the double
action result in a count of 4, disproving this theory.
Adding the action
empty: function (values) { return []; },
shows that an empty array or an empty rule does not contribute a value. Inserting a string or a number into the empty array will increase the count.
The grammar is both, LR(1) and LL(1).
If the same input is processed using
page.LL1#parser and page.LL1#reflect with the same actions,
the count remains unaffected by whatever the actions for null
and empty
return
because the parsing algorithm does not enter empty rules:
AT TUPLE ACTION RETURN
0 (2) Name 'a' rule $accept: count $eof; null
1 (2) Name 'a' rule count: null empty Name null
1 (2) Name 'a' shift Name null
4 (2) Name 'b' rule double: Name; null
4 (2) Name 'b' shift Name null
4 (0) $eof reduce double: Name; [ ] b,b
1 (0) $eof reduce count: null empty Name num 3
0 (0) $eof shift $eof null
0 (0) $eof reduce $accept: count $eof; [ ] 3
3
The trace shows that empty rules are not reduced by the LL(1) parser.
page.SLR1#reflect works just like page.LL1#reflect, i.e., all the examples from Observers can be run by changing from LL(1) to SLR(1).
Error recovery
page.SLR1#parse encounters an error when the current input terminal is not a key in the current state's action table.
It was explained above that the local function recover()
has to deal with this situation.
recover()
first composes a message indicating all input symbols which would be acceptable in the current state:
function recover () {
var msg = current + ' is not allowed\nexpecting:';
for (var a in state.actions)
if (state.actions[a].symbol instanceof page.BNF.T
&& state.actions[a].symbol != this.bnf.token()) // not $error
msg += ' ' + state.actions[a].symbol;
observe(current, 'error', msg);
observe()
gives the observer, if any, a chance to abort parsing and prints the message otherwise.
Next, recover()
discards all input tuples containing the $error
token:
while (current.t == this.bnf.token())
next();
These tuples are produced by the tokenizer discussed previously for illegal input characters.
At this point next()
might have to request more input to be pushed to the parser.
This aborts recovery and might result in another error message.
It is time to discard input, states, or both, until input is expected by a state.
Like it's namesake in one of the LL(1) implementations
this recover()
, too, could traverse backwards over the state stack to find anticipated inputs and act on a match.
However, in order to give a grammar designer some control over the recovery process,
page implements the error
token name which yacc also uses.
error
is a predefined, reserved token name which can be used in a grammar like any other token name,
but it does not match any actual input; in particular, it does not match $error
when sent by the tokenizer.
Instead, if an actual input is not expected in a current state,
recover()
discards states until it finds a state with a shift
action for error
:
do {
while (this.bnf.token().ord in state.actions) {
// have action for error
if (perform(state.actions[this.bnf.token().ord],
new page.Tuple(current.lineno, this.bnf.token(), null))) {
// consumed error tuple, i.e., did shift
...
}
// else did reduce/goto, loop to check this state for error
}
// no action for error, pop stack
this.stack.pop();
state = this.states[this.stack[this.stack.length - 1].state];
} while (this.stack.length > 1);
throw [ 'irrecoverable error' ];
}
If there is any action for error
it is carried out.
If it is a reduce
action, it is followed by goto
and the target state is also checked for an action for error
.
If all the states are discarded without a shift
action for error,
recover()
aborts parsing.
The shift
action for error
marks the end of discarding states.
recover()
will stay in the target state which involves configurations where the marker has made it across error
.
At this point, recover()
will discard input until it is expected in the state:
// consumed error tuple, i.e., did shift
while (true) {
while (!(current.t.ord in state.actions)) {
this.message('discarding ' + current);
next();
}
Once the current input is expected, it's action is performed:
if (perform(state.actions[current.t.ord], current)) {
next(); // did shift current
return; // ready for another tuple
}
}
If perform()
returns true
, the action was a shift
and recover()
considers it's mission accomplished.
Otherwise there was a reduce
action followed by goto
to a new state which is at least likely to expect the current input.
Therefore, reduce()
loops back, checks for the action, and if need be discards some more input.
error
If there is no error
symbol in a grammar, page.SLR1#parse will abort parsing
as soon as unexpected input is found. As an example, if the grammar
root: 'a' 'b';
is used to parse
a c c c b
the following output results:
error: (2) $error 'c' is not allowed
expecting: 'b'
error: (0) $eof is not allowed
expecting: 'a'
irrecoverable error
The second error message results only when recover()
calls next()
and the end of input tuple is not immediately available.
If the entire input is pushed at once
s = new page.SLR1(page.BNFP.from( Grammar.value )); input = s.tokenizer()( Input.value ).concat(null); s.parse(input)
the second error message disappears.
If the rules
root: 'a' error 'b';
root: 'c';
are added, 'c' is not reported as a $error
tuple by the tokenizer, recover()
reports that input is discarded
error: (2) 'c' is not allowed
expecting: 'b'
discarding (2) 'c'
discarding (2) 'c'
discarding (2) 'c'
and the trace
5 (2) 'b' shift 6 null
6 (0) $eof reduce root: 'a' $error 'b'; [ ]
0 (0) $eof goto 3 null
3 (0) $eof accept [ ]
shows that b
is properly recognized.
There seem to be two conflicting objectives for adding error
tokens to a grammar:
Close to the start symbol of the grammar so that
recover()
has a high chance to find a state expectingerror
low on the stack, i.e., to avoid anyirrecoverable error
.Far away from the start symbol so that
recover()
cannot discard many states beforeerror
is found, i.e., to recover but leave most of the parser state intact.
error
in iterations
Practical experience suggests a different approach: programming language grammars employ iterations at several levels, e.g, for declarations, parameter lists, statements, argument lists, etc. A method to make iterations termination-proof would go a long way towards robust error recovery for programming language grammars. Fortunately, the method exists:
%token Number '[0-9]+';
program: line;
program: program line;
line: 'opt' opt;
line: 'some' some;
line: 'many' many;
opt: ;
opt: Number;
opt: error;
some: Number;
some: some Number;
some: error;
some: some error;
many: ;
many: many Number;
many: error;
many: many error;
This grammar contains a few iterations: an optional item, a sequence of one or more items, and a sequence which may be empty.
For LR(1) parsing the iterations can employ left-recursion.
The error
rules are constructed by placing error
into empty rules and in place of every terminal.
The following input tries to be an exhaustive test:
opt
opt 1
opt ','
some 2
some 3 4
some ,
some 5 ,
many
many 6
many ,
many 7 , 8
There is a shift/reduce conflict in the grammar and there are five parsing errors, but all numbers are found:
state 3 has a shift/reduce conflict: $error and rule (many: ;)
error: (4) $error '\'' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
error: (8) $error ',' is not allowed
expecting: Number
error: (9) $error ',' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
error: (13) $error ',' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
error: (14) $error ',' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
1,2,3,4,5,6,7,8
Similarly, here are the rules for delimited sequences:
some-list: Number; # one or more
some-list: some-list ',' Number;
some-list: error;
some-list: some-list error Number;
some-list: some-list ',' error;
some-list: some-list error;
many-list: ; # zero or more
many-list: some-list;
The following exhaustive input can be sent to either some-list
or many-list
:
some-list
some-list 9
some-list 10 , 11
some-list , 12
some-list 13 14
some-list 15 , , 16
some-list 17 ,
Again, lots of errors, but all numbers are collected!
It is tempting to remove all rules which just have error
on the right-hand side and add an error
one level up:
line: error;
This produces many more shift/reduce conflicts and if some-list
starts out with an error the number 12
will not be collected —
it pays to move the error
symbols as far away from the start symbol as possible.
EBNFP
page.EBNFP represents an extended BNF grammar with precedences, translates the grammar to BNF,
and represents it as a page.BNFP object so that both, BNF and extended BNF, can be used to specify grammars
with precedences for LR(1) parsing.
It makes sense to add error
rules during the translation:
%token Number '[0-9]+';
program: line+;
line: 'opt' opt | 'some' some | 'many' many
| 'some-list' some-list | 'many-list' many-list;
opt: Number?;
some: Number+;
many: Number*;
some-list: Number +/ ',';
many-list: Number */ ',';
This is an EBNF grammar for the same language as before and page.EBNFP.from represents it as an object and translates it into BNF:
g = page.EBNFP.from( Grammar.value ,true);
print( g.bnf )
The flag true
requests that error
rules are inserted, which cause a lot of conflicts when a parser is created:
s = new page.SLR1(g);
input = s.tokenizer()( Input.value ).concat(null);
s.parse(input, s.reflect())
Still, when the same input as before is processed, all numbers are recognized.
It was shown previously that conflicts can be hidden with
precedence levels and %prec
clauses.
Therefore, if a precedence level
%right error;
assigning right-associativity is detected in a grammar, page.EBNFP will do both during translation,
insert error
rules and add appropriate %prec
clauses; the flag mentioned above need not be specified.
Translating EBNF
Translating the various constructs of extended BNF to BNF is not very complicated
but there are subtle pitfalls. An EBNF grammar and the associated BNF grammar share all terminals
and the names of all non-terminals. Additional, invisible names start with $
which is otherwise illegal.
To make semantic actions predictable, the translation must guarantee that a BNF rule with a visible non-terminal name is only reduced once in the course of reducing all parts of the EBNF right-hand side, and offered all collected values in proper sequence. This is arranged for by the default for the observers generated by page.LL1#reflect and page.SLR1#reflect. Invisible non-terminals cannot have semantic actions. If there are none, all collected values are passed on to next outer semantic action.
Four constructs need to be translated: groups, choices, sequences, and delimited lists.
Groups are enclosed by parentheses. The content of a group can be assigned to an invisible non-terminal and the non-terminal replaces the parentheses. The translation is implemented in page.Seq#toBNF.
Choices are delimited by |
within a set. Each choice can be assigned as another right-hand side of an invisible non-terminal
and the non-terminal replaces the complete set of choices. The translation is implemented in page.Alt#toBNF.
Sequences and lists are various forms of iterations and page.EBNF and page.EBNFP use right- and left-recursion, respectively, to handle those so that the results are suitable for page.LL1 and page.SLR1, respectively. The grammar shown above contains all iterative constructs
%token Number '[0-9]+';
line: 'opt' opt | 'some' some | 'many' many
| 'some-list' some-list | 'many-list' many-list;
opt: Number?;
some: Number+;
many: Number*;
some-list: Number +/ ',';
many-list: Number */ ',';
and
page.EBNF.from( Grammar.value ).bnf
shows how each is translated to right-recursive BNF,
page.EBNFP.from( Grammar.value ).bnf
shows how each is translated to left-recursive BNF, and if
%right error;
is added to the grammar
page.EBNFP.from( Grammar.value ).bnf
shows BNF with error
rules and %prec
clauses.
Here are the design patterns:
construct | EBNF | EBNFP | %right error; |
---|---|---|---|
opt: Number?; |
opt: ; opt: Number; |
same | opt: %prec error; opt: Number; |
some: Number+; |
some: Number $#; $#: ; $#: Number $#; |
some: Number $#; $#: ; $#: $# Number; |
some: Number $# %prec error; $#: %prec error; $#: $# Number; $#: $# error; |
many: Number*; |
many: $#; $#: ; $#: Number $#; |
many: $#; $#: ; $#: $# Number; |
many: $# %prec error; $#: %prec error; $#: $# Number; $#: $# error; |
some-list: Number +/ ','; |
some-list: Number $#; $#: ; $#: ',' Number $#; |
some-list: Number $#; $#: ; $#: $# ',' Number; |
some-list: Number $# %prec error; $#: %prec error; $#: $# ',' Number; $#: $# error Number; $#: $# ',' error; $#: $# error; |
many-list: Number */ ','; |
many-list: ; many-list: Number $#; $#: ; $#: ',' Number $#; |
many-list: ; many-list: Number $#; $#: ; $#: $# ',' Number; |
many-list: %prec error; many-list: Number $# %prec error; $#: %prec error; $#: $# ',' Number; $#: $# error Number; $#: $# ',' error; $#: $# error; |
The translation is implemented in page.EBNF.Rep#toBNF for opt
, some
, and many
,
and page.EBNF.Lst#toBNF for some-list
and many-list
using right-recursion,
and in page.EBNFP.Rep#toBNF and page.EBNFP.Lst#toBNF using left-recursion, error
rules and %prec
clauses.
The factories page.EBNF and page.EBNFP ensure that the grammar constructs are represented with the appropriate objects.