Recognition
page.SLR1 is used just like page.LL1:
a grammar is represented as an object using a factory method such as page.BNF.from
and wrapped into a new page.SLR1 object to check if the grammar is suitable for LR(1) parsing.
If so, page.SLR1#parse takes input and an optional observer function
and returns true to request more input, or null, or the last value from the observer.
Input is represented as a sequence of page.Tuple objects, or a function delivering such a sequence, just as described previously. Therefore, input can be pushed to page.SLR1#parse or pulled by the method as described before.
page.trace creates a suitable observer function and page.SLR1#dump will show the current stack. As an example, here is how the grammar
root: 'a';
root: 'a' 'b';
root: root 'c';can be used to recognize a:
// create parser and tokenizer
s = new page.SLR1(page.BNF.from( Grammar.value ));
t = s.tokenizer();
// observe recognition and display the active stack
s.parse(t('a'), page.trace());
The trace is short:
AT TUPLE ACTION RETURN
0 (1) 'a' shift 1 nullNote that a parser based on LR(1) cannot send a message at the start of a rule because it selects the rule at the very moment it is reduced.
Actually, this explains the names:
LL(1) stands for left-to-right parsing with leftmost derivation and one symbol lookahead, i.e., rules must be selected a priori, top-down.
LR(1) stands for left-to-right parsing with rightmost derivation and one symbol lookahead, i.e., rules are selected as late as possible, only when a configuration is complete, bottom-up.
s.dump() displays the active stack (and more):
active stack:
0
1 Another push of input with s.parse(t('bcc')) results in more trace output
1 (2) 'b' shift 3 null
3 (2) 'c' reduce root: 'a' 'b'; null
0 (2) 'c' goto 2 null
2 (2) 'c' shift 4 null
4 (2) 'c' reduce root: root 'c'; null
0 (2) 'c' goto 2 null
2 (2) 'c' shift 4 nulland s.dump() shows how the stack has changed:
active stack:
0
2
4 Finally, s.parse(null) concludes parsing and the trace
4 (0) $eof reduce root: root 'c'; null
0 (0) $eof goto 2 null
2 (0) $eof accept nulland returns the result null.
The trace shows that the parser sends shift, reduce, and goto messages to the observer,
accompanied by target state or rule information,
and finally an accept message.
The first column of the trace contains the current state number.
Implementation
page.SLR1 owns the state stack so that input can be pushed. On the first call page.SLR1#parse initializes the stack:
if (!('stack' in this))
this.stack = [ { state: 0 } ];For every call the current state object and input tuple are set up:
var state = this.states[ this.stack[this.stack.length - 1].state ];
var current = null; // current input, none yet availableThe parser enters an infinite loop which will be terminated
by a need for another input sequence, an accept action, or an Error:
try {
while (true) {
// get an input tuple, if needed
if (!current) next();
// deal with the input
...
}
} catch (outcome) {
// request another input sequence?
if (outcome === true) return true;
// end parsing: remove stack
delete this.stack;
// really bad error?
if (outcome instanceof Error) throw outcome;
return outcome[0]; // from the observer, if any
}page.SLR1#parse has a few local functions for abstraction.
try and catch are used so that, e.g., next() can transfer input tuples
from an array or from successive function calls, one by one, to current and throw true
to request another input sequence.
perform() implements the various actions, communicates with the observer, if any,
and returns true if current is consumed.
This function throws an array with the last value
returned by the observer, if any, once accept is reached.
Finally, recover() implements error recovery and might throw an Error
if things go really wrong.
Given these functions dealing with input is easy:
// deal with the input
if (current.t != this.bnf.token() // not $error tuple
&& current.t.ord in state.actions) { // anticipated input
if (perform(state.actions[current.t.ord], current))
next(); // consumed
} else
recover();An $error tuple represents input which the tokenizer did not understand.
It, as well as an input which is not a key in the action table of the current state,
will have to be reported by recover before error recovery is attempted.
perform receives the action and the corresponding input and uses another local function observe()
to communicate with the observer, if any, and ensure that error messages get printed and counted.
An observer can throw a string to be printed as an error message or an Error to abort parsing.
function perform (_action, _tuple) {
var result = observe(_tuple, _action.action, _action.info);accept and shift are very simple to implement.
accept terminates parsing and returns the result from the observer, if any.
A shift action has the new state number as info property.
switch (_action.action) {
case 'accept':
throw [ result ]; // success
case 'shift': // shift to new state
this.stack.push({ state: _action.info, value: result });
state = this.states[_action.info];
return true; // _tuple consumedAs discussed before, an observer will collect values and present them to semantic actions. The collection must be performed in parallel to the state changes so that the values collected for a rule can be identified. Therefore, each state change saves the value returned by the observer on the stack.
A reduce action references the rule to be reduced so that the resulting non-terminal and the rule length can be retrieved.
reduce is more complicated because it is always followed by a goto action, also with a new state number.
The reduce action produces a value from the observer, if any,
and during the state change in goto the value is saved on the stack.
case 'reduce':
// pop the stack by the length of the rule, uncover state
this.stack.length -= _action.info.symbols.length;
state = this.states[this.stack[this.stack.length - 1].state];
// there has to be a goto for the rule's non-terminal
var g = state.actions[_action.info.nt.ord];At this point g must be a goto action which is sent to the observer — mostly for tracing — and the new state
and the value returned by the observer, if any, for reduce are pushed onto the stack:
observe(_tuple, g.action, g.info); // observe the goto
// goto to new state
this.stack.push({ state: g.info, value: result });
state = this.states[g.info];
return false; // _tuple still available
}
}Collecting values
page.SLR1#parse has the current parser state on top of a stack.
New states are reached through shift and goto actions and pushed onto the stack.
For each state the stack holds an associated value which is either null or
returned by an observer for the shift message or for the reduce message preceding the goto message, respectively.
An observer is responsible for a semantic action when a rule is reduced.
The observer receives a reference to the rule as part of the reduce message.
Therefore, the observer can access the exact number of stack entries which contain the associated values.
page.SLR1#yaccpar returns an observer function which is closely patterned after the
yacc parser generator.
For a shift message this observer will return the string representing the terminal, literal or token, in the input.
for a reduce message, without a Builder function, the observer will return the first collected value
or null if the rule is empty. As an example, for the grammar
root: 'a';
root: 'a' 'b';
root: root 'c';and the input
a b c cthe following code
s = new page.SLR1(page.BNF.from( Grammar.value ));
t = s.tokenizer();
o = page.trace(s.yaccpar());
s.parse(t( Input.value ).concat(null), o);
results in this trace:
AT TUPLE ACTION RETURN
0 (2) 'a' shift 1 str a
1 (2) 'b' shift 3 str b
3 (2) 'c' reduce root: 'a' 'b'; str a
0 (2) 'c' goto 2 null
2 (2) 'c' shift 4 str c
4 (2) 'c' reduce root: root 'c'; str a
0 (2) 'c' goto 2 null
2 (2) 'c' shift 4 str c
4 (0) $eof reduce root: root 'c'; str a
0 (0) $eof goto 2 null
2 (0) $eof accept str a
aI.e., all the literals are collected but page.SLR1#yaccpar retains only the very first one across reduce messages.
This changes completely if a semantic action for root is added:
actions = {
root: function (values) { return '[' + values.join(', ') + ']'; }
};
o = page.trace(s.yaccpar(null, actions), /reduce/);
s.parse(t( Input.value ).concat(null), o);Now the Builder function accumulates the literals and previous results and builds something close to a syntax tree:
AT TUPLE ACTION RETURN
3 (2) 'c' reduce root: 'a' 'b'; str [a, b]
4 (2) 'c' reduce root: root 'c'; str [[a, b], c]
4 (0) $eof reduce root: root 'c'; str [[[a, b], c], c]
[[[a, b], c], c]Semantic actions
page.SLR1#reflect returns an observer which does not collect literal values.
For a reduce message, the collected non-null values are concatenated into an array,
i.e., null and empty arrays are ignored and other arrays are flattened.
If there is no Builder function with the name of the non-terminal for the rule to be reduced,
the values are returned by the observer, i.e., they are collected for the next outer reduce message.
As an example, the following grammar
%token Name '[a-z]+';
count: null empty Name double;
null: ;
empty: ;
double: Name;can be used to parse the input
a b
and without semantic actions the observer created by page.SLR1#reflect will collect an array containing a and b.
With the actions
actions = {
'null': function (values) { return null; },
'double': function (values) { values.push(values[0]); return values; },
count: function (values) { return values.length; }
};the observer will return the count 3, i.e., the input strings a and two b.
This count would also result if null were collected and the result of double not flattened;
however, two push operations in the double action result in a count of 4, disproving this theory.
Adding the action
empty: function (values) { return []; },shows that an empty array or an empty rule does not contribute a value. Inserting a string or a number into the empty array will increase the count.
The grammar is both, LR(1) and LL(1).
If the same input is processed using
page.LL1#parser and page.LL1#reflect with the same actions,
the count remains unaffected by whatever the actions for null and empty return
because the parsing algorithm does not enter empty rules:
AT TUPLE ACTION RETURN
0 (2) Name 'a' rule $accept: count $eof; null
1 (2) Name 'a' rule count: null empty Name null
1 (2) Name 'a' shift Name null
4 (2) Name 'b' rule double: Name; null
4 (2) Name 'b' shift Name null
4 (0) $eof reduce double: Name; [ ] b,b
1 (0) $eof reduce count: null empty Name num 3
0 (0) $eof shift $eof null
0 (0) $eof reduce $accept: count $eof; [ ] 3
3The trace shows that empty rules are not reduced by the LL(1) parser.
page.SLR1#reflect works just like page.LL1#reflect, i.e., all the examples from Observers can be run by changing from LL(1) to SLR(1).
Error recovery
page.SLR1#parse encounters an error when the current input terminal is not a key in the current state's action table.
It was explained above that the local function recover() has to deal with this situation.
recover() first composes a message indicating all input symbols which would be acceptable in the current state:
function recover () {
var msg = current + ' is not allowed\nexpecting:';
for (var a in state.actions)
if (state.actions[a].symbol instanceof page.BNF.T
&& state.actions[a].symbol != this.bnf.token()) // not $error
msg += ' ' + state.actions[a].symbol;
observe(current, 'error', msg);observe() gives the observer, if any, a chance to abort parsing and prints the message otherwise.
Next, recover() discards all input tuples containing the $error token:
while (current.t == this.bnf.token())
next();These tuples are produced by the tokenizer discussed previously for illegal input characters.
At this point next() might have to request more input to be pushed to the parser.
This aborts recovery and might result in another error message.
It is time to discard input, states, or both, until input is expected by a state.
Like it's namesake in one of the LL(1) implementations
this recover(), too, could traverse backwards over the state stack to find anticipated inputs and act on a match.
However, in order to give a grammar designer some control over the recovery process,
page implements the error token name which yacc also uses.
error is a predefined, reserved token name which can be used in a grammar like any other token name,
but it does not match any actual input; in particular, it does not match $error when sent by the tokenizer.
Instead, if an actual input is not expected in a current state,
recover() discards states until it finds a state with a shift action for error:
do {
while (this.bnf.token().ord in state.actions) {
// have action for error
if (perform(state.actions[this.bnf.token().ord],
new page.Tuple(current.lineno, this.bnf.token(), null))) {
// consumed error tuple, i.e., did shift
...
}
// else did reduce/goto, loop to check this state for error
}
// no action for error, pop stack
this.stack.pop();
state = this.states[this.stack[this.stack.length - 1].state];
} while (this.stack.length > 1);
throw [ 'irrecoverable error' ];
}If there is any action for error it is carried out.
If it is a reduce action, it is followed by goto and the target state is also checked for an action for error.
If all the states are discarded without a shift action for error,
recover() aborts parsing.
The shift action for error marks the end of discarding states.
recover() will stay in the target state which involves configurations where the marker has made it across error.
At this point, recover() will discard input until it is expected in the state:
// consumed error tuple, i.e., did shift
while (true) {
while (!(current.t.ord in state.actions)) {
this.message('discarding ' + current);
next();
}Once the current input is expected, it's action is performed:
if (perform(state.actions[current.t.ord], current)) {
next(); // did shift current
return; // ready for another tuple
}
}If perform() returns true, the action was a shift and recover() considers it's mission accomplished.
Otherwise there was a reduce action followed by goto to a new state which is at least likely to expect the current input.
Therefore, reduce() loops back, checks for the action, and if need be discards some more input.
error
If there is no error symbol in a grammar, page.SLR1#parse will abort parsing
as soon as unexpected input is found. As an example, if the grammar
root: 'a' 'b';is used to parse
a c c c b
the following output results:
error: (2) $error 'c' is not allowed
expecting: 'b'
error: (0) $eof is not allowed
expecting: 'a'
irrecoverable errorThe second error message results only when recover() calls next() and the end of input tuple is not immediately available.
If the entire input is pushed at once
s = new page.SLR1(page.BNFP.from( Grammar.value )); input = s.tokenizer()( Input.value ).concat(null); s.parse(input)
the second error message disappears.
If the rules
root: 'a' error 'b';
root: 'c';are added, 'c' is not reported as a $error tuple by the tokenizer, recover() reports that input is discarded
error: (2) 'c' is not allowed
expecting: 'b'
discarding (2) 'c'
discarding (2) 'c'
discarding (2) 'c'and the trace
5 (2) 'b' shift 6 null
6 (0) $eof reduce root: 'a' $error 'b'; [ ]
0 (0) $eof goto 3 null
3 (0) $eof accept [ ] shows that b is properly recognized.
There seem to be two conflicting objectives for adding error tokens to a grammar:
Close to the start symbol of the grammar so that
recover()has a high chance to find a state expectingerrorlow on the stack, i.e., to avoid anyirrecoverable error.Far away from the start symbol so that
recover()cannot discard many states beforeerroris found, i.e., to recover but leave most of the parser state intact.
error in iterations
Practical experience suggests a different approach: programming language grammars employ iterations at several levels, e.g, for declarations, parameter lists, statements, argument lists, etc. A method to make iterations termination-proof would go a long way towards robust error recovery for programming language grammars. Fortunately, the method exists:
%token Number '[0-9]+';
program: line;
program: program line;
line: 'opt' opt;
line: 'some' some;
line: 'many' many;
opt: ;
opt: Number;
opt: error;
some: Number;
some: some Number;
some: error;
some: some error;
many: ;
many: many Number;
many: error;
many: many error;
This grammar contains a few iterations: an optional item, a sequence of one or more items, and a sequence which may be empty.
For LR(1) parsing the iterations can employ left-recursion.
The error rules are constructed by placing error into empty rules and in place of every terminal.
The following input tries to be an exhaustive test:
opt
opt 1
opt ','
some 2
some 3 4
some ,
some 5 ,
many
many 6
many ,
many 7 , 8There is a shift/reduce conflict in the grammar and there are five parsing errors, but all numbers are found:
state 3 has a shift/reduce conflict: $error and rule (many: ;)
error: (4) $error '\'' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
error: (8) $error ',' is not allowed
expecting: Number
error: (9) $error ',' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
error: (13) $error ',' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
error: (14) $error ',' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
1,2,3,4,5,6,7,8Similarly, here are the rules for delimited sequences:
some-list: Number; # one or more
some-list: some-list ',' Number;
some-list: error;
some-list: some-list error Number;
some-list: some-list ',' error;
some-list: some-list error;
many-list: ; # zero or more
many-list: some-list;The following exhaustive input can be sent to either some-list or many-list:
some-list
some-list 9
some-list 10 , 11
some-list , 12
some-list 13 14
some-list 15 , , 16
some-list 17 ,Again, lots of errors, but all numbers are collected!
It is tempting to remove all rules which just have error on the right-hand side and add an error one level up:
line: error;This produces many more shift/reduce conflicts and if some-list starts out with an error the number 12 will not be collected —
it pays to move the error symbols as far away from the start symbol as possible.
EBNFP
page.EBNFP represents an extended BNF grammar with precedences, translates the grammar to BNF,
and represents it as a page.BNFP object so that both, BNF and extended BNF, can be used to specify grammars
with precedences for LR(1) parsing.
It makes sense to add error rules during the translation:
%token Number '[0-9]+';
program: line+;
line: 'opt' opt | 'some' some | 'many' many
| 'some-list' some-list | 'many-list' many-list;
opt: Number?;
some: Number+;
many: Number*;
some-list: Number +/ ',';
many-list: Number */ ',';
This is an EBNF grammar for the same language as before and page.EBNFP.from represents it as an object and translates it into BNF:
g = page.EBNFP.from( Grammar.value ,true);
print( g.bnf )The flag true requests that error rules are inserted, which cause a lot of conflicts when a parser is created:
s = new page.SLR1(g);
input = s.tokenizer()( Input.value ).concat(null);
s.parse(input, s.reflect())Still, when the same input as before is processed, all numbers are recognized.
It was shown previously that conflicts can be hidden with
precedence levels and %prec clauses.
Therefore, if a precedence level
%right error;assigning right-associativity is detected in a grammar, page.EBNFP will do both during translation,
insert error rules and add appropriate %prec clauses; the flag mentioned above need not be specified.
Translating EBNF
Translating the various constructs of extended BNF to BNF is not very complicated
but there are subtle pitfalls. An EBNF grammar and the associated BNF grammar share all terminals
and the names of all non-terminals. Additional, invisible names start with $ which is otherwise illegal.
To make semantic actions predictable, the translation must guarantee that a BNF rule with a visible non-terminal name is only reduced once in the course of reducing all parts of the EBNF right-hand side, and offered all collected values in proper sequence. This is arranged for by the default for the observers generated by page.LL1#reflect and page.SLR1#reflect. Invisible non-terminals cannot have semantic actions. If there are none, all collected values are passed on to next outer semantic action.
Four constructs need to be translated: groups, choices, sequences, and delimited lists.
Groups are enclosed by parentheses. The content of a group can be assigned to an invisible non-terminal and the non-terminal replaces the parentheses. The translation is implemented in page.Seq#toBNF.
Choices are delimited by | within a set. Each choice can be assigned as another right-hand side of an invisible non-terminal
and the non-terminal replaces the complete set of choices. The translation is implemented in page.Alt#toBNF.
Sequences and lists are various forms of iterations and page.EBNF and page.EBNFP use right- and left-recursion, respectively, to handle those so that the results are suitable for page.LL1 and page.SLR1, respectively. The grammar shown above contains all iterative constructs
%token Number '[0-9]+';
line: 'opt' opt | 'some' some | 'many' many
| 'some-list' some-list | 'many-list' many-list;
opt: Number?;
some: Number+;
many: Number*;
some-list: Number +/ ',';
many-list: Number */ ',';
and
page.EBNF.from( Grammar.value ).bnfshows how each is translated to right-recursive BNF,
page.EBNFP.from( Grammar.value ).bnfshows how each is translated to left-recursive BNF, and if
%right error;is added to the grammar
page.EBNFP.from( Grammar.value ).bnfshows BNF with error rules and %prec clauses.
Here are the design patterns:
| construct | EBNF | EBNFP | %right error; |
|---|---|---|---|
opt: Number?; |
opt: ;opt: Number; |
same | opt: %prec error;opt: Number; |
some: Number+; |
some: Number $#;$#: ;$#: Number $#; |
some: Number $#;$#: ;$#: $# Number; |
some: Number $# %prec error;$#: %prec error;$#: $# Number;$#: $# error; |
many: Number*; |
many: $#;$#: ;$#: Number $#; |
many: $#;$#: ;$#: $# Number; |
many: $# %prec error;$#: %prec error;$#: $# Number;$#: $# error; |
some-list:Number +/ ','; |
some-list: Number $#;$#: ;$#: ',' Number $#; |
some-list: Number $#;$#: ;$#: $# ',' Number; |
some-list:Number $# %prec error;$#: %prec error;$#: $# ',' Number;$#: $# error Number;$#: $# ',' error;$#: $# error; |
many-list:Number */ ','; |
many-list: ;many-list: Number $#;$#: ;$#: ',' Number $#; |
many-list: ;many-list: Number $#;$#: ;$#: $# ',' Number; |
many-list: %prec error;many-list:Number $# %prec error;$#: %prec error;$#: $# ',' Number;$#: $# error Number;$#: $# ',' error;$#: $# error; |
The translation is implemented in page.EBNF.Rep#toBNF for opt, some, and many,
and page.EBNF.Lst#toBNF for some-list and many-list using right-recursion,
and in page.EBNFP.Rep#toBNF and page.EBNFP.Lst#toBNF using left-recursion, error rules and %prec clauses.
The factories page.EBNF and page.EBNFP ensure that the grammar constructs are represented with the appropriate objects.
page