Push and pull
Recursive descent requires a LL(1) grammar to decide if a sequence of input symbols is a sentence conforming to the grammar, i.e., given a LL(1) grammar, recursive descent can recognize sentences. There are different strategies to implement the recursive descent algorithm and there are different strategies to supply input.
page.LL1#parser (“parse recursively”) implements recursive descent with a visitor pattern, i.e., a recursive function which traverses the grammar representation, whereas page.LL1#parse implements recursive descent with a stack. Both methods can be called with a function as an argument, which they will call back repeatedly to pull input for analysis. Alternatively, input can be pushed as an argument to either method.
There is a significant difference, however: page.LL1#parser maintains the state of recognition as a nest of recursive calls on the visitor function which cannot be re-entered. Therefore, only the complete input can be pushed to page.LL1#parser at once.
page.LL1#parse saves the state of recognition explicitly as a stack, and input can be pushed piecemeal by calling the method repeatedly to continue recognition with the current state.
The difference between push and pull can be significant if a parser is used in the context of a graphical user interface, e.g., for a calculator. If input can be pushed to recognition, recognition can be an event handler. If, however, input is pulled, recognition must wait until an event provides input: this is likely to require multi-threading and thread communication.
Input
A sequence of input symbols must be presented to either recognition method
as an array of page.Tuple values which can be produced with a
Tokenizer.
null
as an array element indicates end of input.
An input stream can be modeled as a parameterless function which returns successive pieces
of such an array. The function must eventually return null
as an array element,
or null
as a return value, or an empty array.
Such a function can be passed to both recognition methods
to let them pull input. For example:
$ jjs -scripting page.min.js -
jjs> bnf = page.BNF.ctor()
jjs> t = bnf.tokenizer()
jjs> function stream () { return t(readLine('> ')); }
jjs> stream().join('\n')
> %token Number '[0-9]+';
(1) '%token'
(1) Name 'Number'
(1) Literal '\'[0-9]+\''
(1) ';'
jjs is a JavaScript shell and part of a Java installation.
If -scripting
is specified, readLine()
can be used to prompt and read standard input.
readLine()
returns an empty input line as an empty string.
The tokenizer turns this into an empty array:
jjs> stream().length
>
0
The function stream()
defined above would pass standard input on demand to a recognition method:
jjs> l = new LL1(bnf)
jjs> l.parser(stream)
> %token Number '[0-9]+';
> addition: Number;
> addition: Number '+' addition;
>
null
There is no error message, i.e., the input lines together are a BNF sentence because they conform to the grammar for BNF represented with page.BNF.ctor.
If not as a stream, the entire page.Tuple array, with or without a trailing null
element,
must be passed to page.LL1#parser at once:
jjs> l.parser(t( ' addition: ' ))
error: (0) $eof '' is not allowed in "rule: nt ':' rhs ';';"
expecting: ';'
(0) $eof '' irrecoverable error
This is not a BNF sentence because a semicolon is missing.
page.LL1#parse can be called with one or more array elements at a time,
and there has to be a null
element, value, or empty array to indicate completion.
jjs> l.parse(t(' addition: '))
true
jjs> l.parse(t(';'))
true
jjs> l.parse(null)
null
Visitor implementation
page.LL1#parser implements recursive descent as a visitor to the objects representing a grammar.
The ground rule is to only visit an object if the current
input page.Tuple is in first
:
if (this.current.t.ord in grammar.rules[0].first
|| expecting(grammar.rules[0], grammar.rules[0].nt) == -1) {
visit(grammar.rules[0]);
Hopefully, we are off to a good start, i.e., the current input is in first
for $accept
and rule 0.
If so, we visit the object describing a rule. If not, we note an error: $accept
cannot be empty
because it requires $eof
.
Error recovery is managed in the local function expecting()
and is discussed below.
Grammars and BNF and Extended BNF briefly discussed the factory methods to construct the objects
to represent a grammar
and introduced the classes which they belong to. Depending on whether we visit a
BNF or EBNF grammar, we will only encounter one set of classes at a time, but visit()
can be implemented
for all classes at once.
BNF
Here is what happens when visiting a BNF grammar — just three classes have to be considered because the classes page.BNF.Lit for literals and page.BNF.Token for tokens have a common base class page.BNF.T for all terminals:
object | class | visit action |
---|---|---|
rule | page.BNF.Rule | visit the symbols on the right-hand side in order; current input must fit first , empty symbol can be skipped! |
terminal | page.BNF.T | current input is in first , i.e., matches the symbol; therefore, advance in input. |
non-terminal | page.BNF.NT | check current input and first to visit first possible rule for the non-terminal; one of the rules must fit, empty is not an option here! |
It was discussed earlier that this approach leaves no wiggle room if the grammar satisfies LL(1).
EBNF
Visiting an EBNF grammar directly, rather than its BNF counterpart, requires dealing with a few more classes:
construct | class | visit action |
---|---|---|
rule | page.EBNF.Rule | visit the single symbol on the right-hand side; current input must fit first ! |
terminal | page.BNF.T | current input is in first , i.e., matches the symbol; therefore, advance in input. |
non-terminal | page.EBNF.NT | visit the single rule; current input must fit first ! |
a ⎢ b |
page.EBNF.Alt | visit the first possible alternative and return the result; current input and first sets must select just one, empty is not an option when we got here! |
a +/ b a */ b |
page.EBNF.Lst | visit the body and then delimiter and body as often as possible and allowed; current input, first , and empty should select properly. |
a + a * a ? |
page.EBNF.Rep | visit the descendant as often as possible and allowed; current input, first , and empty should select properly. |
(a b) |
page.EBNF.Seq | visit the symbols in the sequence in order; current input must fit first , empty symbol can be skipped! |
Error recovery
What happens if a visitor is confronted with a current input symbol that does not fit
into the upcoming first
set?
The visitor has two choices: discard input and advance until one of the expected symbols is found,
or return from one or more levels of recursion hoping that the current symbol fits into a branch node
which is closer to the root of the syntax tree under construction.
Either approach is risky. Returning from recursion might be the right choice to match an input symbol such as a closing brace, bracket, or parenthesis at the proper grammar level, but a duplicate delimiter such as a comma can result in removing a previously found opening parenthesis from the recursion before the balancing parenthesis shows up in the input.
page.LL1#parser uses the first
and follow
sets to try and make a better guess.
When things start to go wrong, the local function expecting()
is called to report and fix things:
function expecting (context, node, follow) {
var msg = this.current + ' is not allowed in "'+ context + '"\nexpecting:';
var expect = {};
page.merge(expect, node.first);
if (node.empty && node.follow) page.merge(node.follow);
for (var e in expect) msg += ' ' + expect[e];
this.error(msg);
context
represents the grammar rule or EBNF construct where the error is detected.
It is shown in the error message displayed by an error handling method.
node
is the object in the grammar which should be matched.
The error message tries to explain which terminal symbols were expected.
However, it is possible that only some symbols in node.follow
come into play if the node
can accept empty input.
follow
belongs to context
and will be consulted to see if the recursion level should be terminated.
current
is the current input page.Tuple and next()
is a local function
which moves current
on to the next page.Tuple.
while (this.current.t != grammar.lit()) { // don't go past $eof
if (node.follow && this.current.t.ord in node.follow)
return 0; // bypass the node
if (follow && this.current.t.ord in follow)
return 1; // terminate visit
next(); // move on to new input
if (this.current.t.ord in node.first) // new input fits?
return -1; // retry the node!
}
throw new Error(this.current + ' irrecoverable error'); // crash
}
expecting()
returns 0
if the current input can follow node
and 1
if the current input can follow context
, i.e., if the recursion level should be abandoned.
The function returns -1
once it discarded enough input so that node
can be entered.
Example
Here is a grammar for nesting various kinds of parentheses. The grammar is typical for many programming languages:
%token Name '[a-z]+';
%token Number '[0-9]+';
program: braces*;
braces: '{' brackets* '}';
brackets: '[' parentheses */ ';' ']' '.';
parentheses: '(' argument */ ',' ')';
argument: Name | Number;
Something like the following input can be used for experiments:
{
[
( one, 2
] .
}
This produces the following output
error: (5) ']' is not allowed in "'(' argument */ ',' ')'"
expecting: ')'
but the dreaded irrecoverable error
is missing, i.e., expecting()
noticed the closing bracket
in the follow
set for parentheses
(the context
passed to expecting()
) and completed recognition.
Recognition in page can be traced. The output is voluminous and will be discussed later.
At this point, it is interesting to see how terminal symbols are matched (shift
) and how non-terminal symbols are matched (reduce
):
error: (5) ']' is not allowed in "'(' argument */ ',' ')'"
expecting: ')'
4 (5) ']' reduce parentheses: '(' argum obj one,2
4 (5) ']' shift ']' obj null
4 (5) '.' shift '.' obj null
3 (6) '}' reduce brackets: '[' parenthe obj one,2
This excerpt shows that the ]
symbol is in fact matched,
and that the visitor calls for parentheses
(albeit with an error) and brackets
are completed.
For additional experiments one can remove the 2
, the comma, or the closing bracket, etc.
The trace illustrates quite well how follow
helps to recover and fails when there is not enough to look ahead.
The first
and follow
sets can be seen if the grammar is described, i.e., if page.EBNF#dump is used.
Stack implementation
It is the responsibility of each visitor function call to arrange for more visitor calls in order to match a particular grammar construct, e.g., for a page.BNF.Rule the symbols on the right-hand side have to be visited, one after another. Each nested function call is a piece of the state of visiting the grammar as a whole.
As far as visiting rules is concerned, there is no real need for recursive function calls. The current rule and next position on the right-hand side can be stored on a stack instead. page.LL1#parse implements recursive descent with such a stack. The process starts with a stack element pointing to the beginning of rule 0 for a BNF grammar:
if (!('states' in this)) {
this.states = [ null ]; // sentinel
this.top = { rule: this.bnf.rules[0], position: 0 };
}
The method is serially reusable, i.e., it will create and own a stack over several calls and release it once parsing is completed or abandoned. Therefore, a page.LL1 object can only support one parsing operation at a time.
Parsing now happens in a loop:
this.current = null; // no input as yet
loop: while (this.top) { // any active rule?
if (! this.current) next(); // get input as needed
if (this.top.position < this.top.rule.symbols.length) {
var symbol = this.top.rule.symbols[this.top.position];
// try to match the symbol on the right-hand side
} else // completed top rule
this.top = this.states.pop();
}
// sentinel reached, success!
Rule 0 at least contains $eof
. As long as there is a symbol left on the right-hand side of the top rule
it has to match the current input.
Once all symbols are taken care of, the rule is popped off the stack and null
as a sentinel is reached
when rule 0 has been completed, i.e., when a sentence and $eof
have been matched.
Here is how the symbols on the right-hand side are matched. If the symbol is a terminal it has to match the current input tuple:
if (symbol instanceof page.BNF.T) { // terminal?
if (symbol == this.current.t) { // equal to input?
++ this.top.position; // done with symbol
this.current = null; // done with input
continue loop;
}
}
If it does, the stack moves past the symbol in the rule and the parser is done with the current input tuple.
If the grammar symbol is a non-terminal, the current input should be in the first
set:
else if (this.current.t.ord in symbol.first) { // non-terminal, input in first?
if (symbol.rules.some(function (rule) { // input in first of some rule?
if (this.current.t.ord in rule.first) {
++ this.top.position; // done with symbol
this.states.push(this.top); // activate selected rule
this.top = { rule: rule, position: 0 };
return true;
}
return false;
}, this))
continue loop;
}
If the current input is in first
for the non-terminal, LL(1) ensures that the current input
is in first
for exactly one rule of the non-terminal. The stack can move past the
non-terminal symbol, and the relevant rule is pushed on top of the stack.
The loop continues and has to match this rule beginning with the current input.
Alternatively, if the input does not match the non-terminals first
set,
the non-terminal should permit empty
:
} else if (symbol.empty && this.current.t.ord in symbol.follow) { // empty input ok?
++ this.top.position; // done with symbol
continue loop;
}
Theoretically, there is no need to check follow
. However, if the current input cannot follow
the non-terminal it is better not to move on in the rule because error recovery might find
an input that can still match the non-terminal.
var msg = this.current + ' is not allowed in "'+ this.top.rule + '"\nexpecting:';
var expect = {};
for (var p = this.top.position; p < this.top.rule.symbols.length; ++ p) {
var s = this.top.rule.symbols[p];
page.merge(expect, s.first);
if (!s.empty) break;
}
if (p >= this.top.rule.symbols.length) page.merge(expect, this.top.rule.nt.follow);
for (var e in expect) msg += ' ' + expect[e];
this.error(msg);
If there is neither a match for a terminal, nor a match to first
in a non-terminal,
nor an empty
non-terminal, an error message is issued.
As the code shows, the message can include all terminals which can legitimately appear in the input at this point.
With the error reported it is time for the next section.
Error recovery
At this point the current
input does not match the next symbol in the top rule of the state stack.
All outer rules on the state stack are positioned at the next symbol to be matched
after the nested rules are completed. It does not take much code to check
if the current
input fits any upcoming first
set of the top or any outer rule:
function recover (state) {
for (var pos = state.position; pos < state.rule.symbols.length; ++ pos)
if (this.current.t.ord in state.rule.symbols[pos].first) {
state.position = pos;
return true;
}
return false;
}
If recover()
finds that current
fits the first
set of an upcoming symbol in a given state
(stack entry)
it changes the position
to point to that symbol and returns true
.
recover()
is applied to the stack top
and all outer entries:
if (recover(this.top)) continue loop;
for (var s = this.states.length; -- s > 0; )
if (recover(this.states[s])) {
this.top = this.states[s];
this.states = this.states.slice(0, s);
continue loop;
}
If recover()
succeeds somewhere on the stack, the stack is cut back to that level
and parsing continues in an outer rule.
this.message('discarding ' + current);
current = null;
continue loop;
If there is absolutely no match, the current input is discarded and parsing continues with the stack left unchanged. This may draw further error messages.
Comparison
How effective is error recovery? Consider the grammar introduced previously:
%token Name '[a-z]+';
%token Number '[0-9]+';
program: braces*;
braces: '{' brackets* '}';
brackets: '[' parentheses */ ';' ']' '.';
parentheses: '(' argument */ ',' ')';
argument: Name | Number;
The following input draws the same error from either algorithm:
{
[
( one, 2
] .
}
If the closing bracket ]
is removed, the recursive algorithm cannot recover because the period .
is not in a follow
set,
whereas the stack algorithm recovers because recover()
checks every symbol in every outer rule.
If instead the opening bracket [
is removed, the recursive algorithm recovers because at brackets*
in the top rule
braces: '{' brackets* '}';
none of the upcoming symbols fits first
of brackets*
but the closing brace '}' does fit follow
of brackets*
.
The stack algorithm also discards all symbols until the closing brace but it produces a new error message
for each discarded symbol because every new input is run past all active rules.
The extra error messages can be suppressed with a boolean variable throttle
which is set after one error message
and not cleared until an input symbol is actually accepted again.
In summary, both algorithms recover reasonably well within nested structures. The stack algorithm has a small advantage when it comes to recovery within a sequence.