Top-down parsing

Push and pull

Recursive descent requires a LL(1) grammar to decide if a sequence of input symbols is a sentence conforming to the grammar, i.e., given a LL(1) grammar, recursive descent can recognize sentences. There are different strategies to implement the recursive descent algorithm and there are different strategies to supply input.

page.LL1#parser (“parse recursively”) implements recursive descent with a visitor pattern, i.e., a recursive function which traverses the grammar representation, whereas page.LL1#parse implements recursive descent with a stack. Both methods can be called with a function as an argument, which they will call back repeatedly to pull input for analysis. Alternatively, input can be pushed as an argument to either method.

There is a significant difference, however: page.LL1#parser maintains the state of recognition as a nest of recursive calls on the visitor function which cannot be re-entered. Therefore, only the complete input can be pushed to page.LL1#parser at once.

page.LL1#parse saves the state of recognition explicitly as a stack, and input can be pushed piecemeal by calling the method repeatedly to continue recognition with the current state.

The difference between push and pull can be significant if a parser is used in the context of a graphical user interface, e.g., for a calculator. If input can be pushed to recognition, recognition can be an event handler. If, however, input is pulled, recognition must wait until an event provides input: this is likely to require multi-threading and thread communication.

Input

A sequence of input symbols must be presented to either recognition method as an array of page.Tuple values which can be produced with a Tokenizer. null as an array element indicates end of input.

An input stream can be modeled as a parameterless function which returns successive pieces of such an array. The function must eventually return null as an array element, or null as a return value, or an empty array. Such a function can be passed to both recognition methods to let them pull input. For example:

$ jjs -scripting page.min.js -
jjs> bnf = page.BNF.ctor()
jjs> t = bnf.tokenizer()
jjs> function stream () { return t(readLine('> ')); }
jjs> stream().join('\n')
> %token Number '[0-9]+';
(1) '%token'
(1) Name 'Number'
(1) Literal '\'[0-9]+\''
(1) ';'

jjs is a JavaScript shell and part of a Java installation. If -scripting is specified, readLine() can be used to prompt and read standard input. readLine() returns an empty input line as an empty string. The tokenizer turns this into an empty array:

jjs> stream().length
> 
0

The function stream() defined above would pass standard input on demand to a recognition method:

jjs> l = new LL1(bnf)
jjs> l.parser(stream)
> %token Number '[0-9]+';
> addition: Number;
> addition: Number '+' addition;
>
null

There is no error message, i.e., the input lines together are a BNF sentence because they conform to the grammar for BNF represented with page.BNF.ctor.

If not as a stream, the entire page.Tuple array, with or without a trailing null element, must be passed to page.LL1#parser at once:

jjs> l.parser(t( ' addition: ' ))
error: (0) $eof '' is not allowed in "rule: nt ':' rhs ';';"
expecting: ';'
(0) $eof '' irrecoverable error

This is not a BNF sentence because a semicolon is missing.

page.LL1#parse can be called with one or more array elements at a time, and there has to be a null element, value, or empty array to indicate completion.

jjs>  l.parse(t(' addition: '))
true
jjs> l.parse(t(';'))               
true
jjs> l.parse(null)
null

Visitor implementation

page.LL1#parser implements recursive descent as a visitor to the objects representing a grammar. The ground rule is to only visit an object if the current input page.Tuple is in first:

if (this.current.t.ord in grammar.rules[0].first
    || expecting(grammar.rules[0], grammar.rules[0].nt) == -1) {
  visit(grammar.rules[0]);

Hopefully, we are off to a good start, i.e., the current input is in first for $accept and rule 0. If so, we visit the object describing a rule. If not, we note an error: $accept cannot be empty because it requires $eof. Error recovery is managed in the local function expecting() and is discussed below.

Grammars and BNF and Extended BNF briefly discussed the factory methods to construct the objects to represent a grammar and introduced the classes which they belong to. Depending on whether we visit a BNF or EBNF grammar, we will only encounter one set of classes at a time, but visit() can be implemented for all classes at once.

BNF

Here is what happens when visiting a BNF grammar — just three classes have to be considered because the classes page.BNF.Lit for literals and page.BNF.Token for tokens have a common base class page.BNF.T for all terminals:

object class visit action
rule page.BNF.Rule visit the symbols on the right-hand side in order; current input must fit first, empty symbol can be skipped!
terminal page.BNF.T current input is in first, i.e., matches the symbol; therefore, advance in input.
non-terminal page.BNF.NT check current input and first to visit first possible rule for the non-terminal; one of the rules must fit, empty is not an option here!

It was discussed earlier that this approach leaves no wiggle room if the grammar satisfies LL(1).

EBNF

Visiting an EBNF grammar directly, rather than its BNF counterpart, requires dealing with a few more classes:

construct class visit action
rule page.EBNF.Rule visit the single symbol on the right-hand side; current input must fit first!
terminal page.BNF.T current input is in first, i.e., matches the symbol; therefore, advance in input.
non-terminal page.EBNF.NT visit the single rule; current input must fit first!
a ⎢ b page.EBNF.Alt visit the first possible alternative and return the result; current input and first sets must select just one, empty is not an option when we got here!
a +/ b
a */ b
page.EBNF.Lst visit the body and then delimiter and body as often as possible and allowed; current input, first, and empty should select properly.
a +
a *
a ?
page.EBNF.Rep visit the descendant as often as possible and allowed; current input, first, and empty should select properly.
(a b) page.EBNF.Seq visit the symbols in the sequence in order; current input must fit first, empty symbol can be skipped!


Error recovery

What happens if a visitor is confronted with a current input symbol that does not fit into the upcoming first set? The visitor has two choices: discard input and advance until one of the expected symbols is found, or return from one or more levels of recursion hoping that the current symbol fits into a branch node which is closer to the root of the syntax tree under construction.

Either approach is risky. Returning from recursion might be the right choice to match an input symbol such as a closing brace, bracket, or parenthesis at the proper grammar level, but a duplicate delimiter such as a comma can result in removing a previously found opening parenthesis from the recursion before the balancing parenthesis shows up in the input.

page.LL1#parser uses the first and follow sets to try and make a better guess. When things start to go wrong, the local function expecting() is called to report and fix things:

function expecting (context, node, follow) {
  var msg = this.current + ' is not allowed in "'+ context + '"\nexpecting:';
  var expect = {};
  page.merge(expect, node.first);
  if (node.empty && node.follow) page.merge(node.follow);
  for (var e in expect) msg += ' ' + expect[e];
  this.error(msg);

context represents the grammar rule or EBNF construct where the error is detected. It is shown in the error message displayed by an error handling method. node is the object in the grammar which should be matched.

The error message tries to explain which terminal symbols were expected. However, it is possible that only some symbols in node.follow come into play if the node can accept empty input.

follow belongs to context and will be consulted to see if the recursion level should be terminated. current is the current input page.Tuple and next() is a local function which moves current on to the next page.Tuple.

  while (this.current.t != grammar.lit()) {    // don't go past $eof
    if (node.follow && this.current.t.ord in node.follow)
      return 0;                                // bypass the node
    if (follow && this.current.t.ord in follow)
      return 1;                                // terminate visit
    next();                                    // move on to new input
    if (this.current.t.ord in node.first)      // new input fits?
      return -1;                               // retry the node!
  }             
  throw new Error(this.current + ' irrecoverable error');  // crash 
}

expecting() returns 0 if the current input can follow node and 1 if the current input can follow context, i.e., if the recursion level should be abandoned. The function returns -1 once it discarded enough input so that node can be entered.

Example

Here is a grammar for nesting various kinds of parentheses. The grammar is typical for many programming languages:

%token Name '[a-z]+';
%token Number '[0-9]+';

program: braces*;
braces: '{' brackets* '}';
brackets: '[' parentheses */ ';' ']' '.';
parentheses: '(' argument */ ',' ')';
argument: Name | Number;

Something like the following input can be used for experiments:

{
  [
    ( one, 2  
  ] . 
}

This produces the following output

error: (5) ']' is not allowed in "'(' argument */ ',' ')'"
expecting: ')'

but the dreaded irrecoverable error is missing, i.e., expecting() noticed the closing bracket in the follow set for parentheses (the context passed to expecting()) and completed recognition.

Recognition in page can be traced. The output is voluminous and will be discussed later. At this point, it is interesting to see how terminal symbols are matched (shift) and how non-terminal symbols are matched (reduce):

error: (5) ']' is not allowed in "'(' argument */ ',' ')'"
expecting: ')'
4   (5) ']'                 reduce  parentheses: '(' argum  obj one,2
4   (5) ']'                 shift   ']'                     obj null
4   (5) '.'                 shift   '.'                     obj null
3   (6) '}'                 reduce  brackets: '[' parenthe  obj one,2

This excerpt shows that the ] symbol is in fact matched, and that the visitor calls for parentheses (albeit with an error) and brackets are completed.

For additional experiments one can remove the 2, the comma, or the closing bracket, etc. The trace illustrates quite well how follow helps to recover and fails when there is not enough to look ahead. The first and follow sets can be seen if the grammar is described, i.e., if page.EBNF#dump is used.

Stack implementation

It is the responsibility of each visitor function call to arrange for more visitor calls in order to match a particular grammar construct, e.g., for a page.BNF.Rule the symbols on the right-hand side have to be visited, one after another. Each nested function call is a piece of the state of visiting the grammar as a whole.

As far as visiting rules is concerned, there is no real need for recursive function calls. The current rule and next position on the right-hand side can be stored on a stack instead. page.LL1#parse implements recursive descent with such a stack. The process starts with a stack element pointing to the beginning of rule 0 for a BNF grammar:

if (!('states' in this)) {
  this.states = [ null ];  // sentinel
  this.top = { rule: this.bnf.rules[0], position: 0 };
}

The method is serially reusable, i.e., it will create and own a stack over several calls and release it once parsing is completed or abandoned. Therefore, a page.LL1 object can only support one parsing operation at a time.

Parsing now happens in a loop:

this.current = null;           // no input as yet
loop: while (this.top) {       // any active rule?
  if (! this.current) next();  // get input as needed

  if (this.top.position < this.top.rule.symbols.length) {
    var symbol = this.top.rule.symbols[this.top.position];

    // try to match the symbol on the right-hand side

  } else                       // completed top rule
    this.top = this.states.pop();
}
// sentinel reached, success!

Rule 0 at least contains $eof. As long as there is a symbol left on the right-hand side of the top rule it has to match the current input.

Once all symbols are taken care of, the rule is popped off the stack and null as a sentinel is reached when rule 0 has been completed, i.e., when a sentence and $eof have been matched.

Here is how the symbols on the right-hand side are matched. If the symbol is a terminal it has to match the current input tuple:

if (symbol instanceof page.BNF.T) {  // terminal?
  if (symbol == this.current.t) {    // equal to input?
    ++ this.top.position;            // done with symbol
    this.current = null;             // done with input
    continue loop;
  }
}

If it does, the stack moves past the symbol in the rule and the parser is done with the current input tuple.

If the grammar symbol is a non-terminal, the current input should be in the first set:

else if (this.current.t.ord in symbol.first) {  // non-terminal, input in first?
    if (symbol.rules.some(function (rule) {     // input in first of some rule?
          if (this.current.t.ord in rule.first) {
            ++ this.top.position;               // done with symbol
            this.states.push(this.top);         // activate selected rule
            this.top = { rule: rule, position: 0 };
            return true;
          }
          return false;
        }, this))
      continue loop;
}

If the current input is in first for the non-terminal, LL(1) ensures that the current input is in first for exactly one rule of the non-terminal. The stack can move past the non-terminal symbol, and the relevant rule is pushed on top of the stack. The loop continues and has to match this rule beginning with the current input.

Alternatively, if the input does not match the non-terminals first set, the non-terminal should permit empty:

} else if (symbol.empty && this.current.t.ord in symbol.follow) { // empty input ok?
  ++ this.top.position;   // done with symbol
  continue loop;
}

Theoretically, there is no need to check follow. However, if the current input cannot follow the non-terminal it is better not to move on in the rule because error recovery might find an input that can still match the non-terminal.

var msg = this.current + ' is not allowed in "'+ this.top.rule + '"\nexpecting:';
var expect = {};
for (var p = this.top.position; p < this.top.rule.symbols.length; ++ p) {
  var s = this.top.rule.symbols[p];
  page.merge(expect, s.first);
  if (!s.empty) break;
}
if (p >= this.top.rule.symbols.length) page.merge(expect, this.top.rule.nt.follow);
for (var e in expect) msg += ' ' + expect[e];
this.error(msg);

If there is neither a match for a terminal, nor a match to first in a non-terminal, nor an empty non-terminal, an error message is issued. As the code shows, the message can include all terminals which can legitimately appear in the input at this point. With the error reported it is time for the next section.

Error recovery

At this point the current input does not match the next symbol in the top rule of the state stack. All outer rules on the state stack are positioned at the next symbol to be matched after the nested rules are completed. It does not take much code to check if the current input fits any upcoming first set of the top or any outer rule:

function recover (state) {
  for (var pos = state.position; pos < state.rule.symbols.length; ++ pos)
    if (this.current.t.ord in state.rule.symbols[pos].first) {
      state.position = pos;
      return true;
    }
  return false;
}

If recover() finds that current fits the first set of an upcoming symbol in a given state (stack entry) it changes the position to point to that symbol and returns true.

recover() is applied to the stack top and all outer entries:

if (recover(this.top)) continue loop;
for (var s = this.states.length; -- s > 0; )
  if (recover(this.states[s])) {
    this.top = this.states[s];
    this.states = this.states.slice(0, s);
    continue loop;
  }

If recover() succeeds somewhere on the stack, the stack is cut back to that level and parsing continues in an outer rule.

this.message('discarding ' + current);
current = null;
continue loop;

If there is absolutely no match, the current input is discarded and parsing continues with the stack left unchanged. This may draw further error messages.

Comparison

How effective is error recovery? Consider the grammar introduced previously:

%token Name '[a-z]+';
%token Number '[0-9]+';

program: braces*;
braces: '{' brackets* '}';
brackets: '[' parentheses */ ';' ']' '.';
parentheses: '(' argument */ ',' ')';
argument: Name | Number;

The following input draws the same error from either algorithm:

{
  [
    ( one, 2  
  ] . 
}

If the closing bracket ] is removed, the recursive algorithm cannot recover because the period . is not in a follow set, whereas the stack algorithm recovers because recover() checks every symbol in every outer rule.

If instead the opening bracket [ is removed, the recursive algorithm recovers because at brackets* in the top rule

braces: '{' brackets* '}';

none of the upcoming symbols fits first of brackets* but the closing brace '}' does fit follow of brackets*.

The stack algorithm also discards all symbols until the closing brace but it produces a new error message for each discarded symbol because every new input is run past all active rules. The extra error messages can be suppressed with a boolean variable throttle which is set after one error message and not cleared until an input symbol is actually accepted again.

In summary, both algorithms recover reasonably well within nested structures. The stack algorithm has a small advantage when it comes to recovery within a sequence.

Next: Observers Previous: LL(1)