Bottom-up parsing

Recognition

page.SLR1 is used just like page.LL1: a grammar is represented as an object using a factory method such as page.BNF.from and wrapped into a new page.SLR1 object to check if the grammar is suitable for LR(1) parsing. If so, page.SLR1#parse takes input and an optional observer function and returns true to request more input, or null, or the last value from the observer.

Input is represented as a sequence of page.Tuple objects, or a function delivering such a sequence, just as described previously. Therefore, input can be pushed to page.SLR1#parse or pulled by the method as described before.

page.trace creates a suitable observer function and page.SLR1#dump will show the current stack. As an example, here is how the grammar

root: 'a';
root: 'a' 'b';
root: root 'c';

can be used to recognize a:

// create parser and tokenizer
s = new page.SLR1(page.BNF.from( Grammar.value ));
t = s.tokenizer();

// observe recognition and display the active stack
s.parse(t('a'), page.trace());

The trace is short:

AT  TUPLE            ACTION                          RETURN
0   (1) 'a'          shift  1                        null

Note that a parser based on LR(1) cannot send a message at the start of a rule because it selects the rule at the very moment it is reduced.

Actually, this explains the names:

LL(1) stands for left-to-right parsing with leftmost derivation and one symbol lookahead, i.e., rules must be selected a priori, top-down.

LR(1) stands for left-to-right parsing with rightmost derivation and one symbol lookahead, i.e., rules are selected as late as possible, only when a configuration is complete, bottom-up.

s.dump() displays the active stack (and more):

active stack:
  0     
  1     

Another push of input with s.parse(t('bcc')) results in more trace output

1   (2) 'b'          shift  3                        null
3   (2) 'c'          reduce root: 'a' 'b';           null
0   (2) 'c'          goto   2                        null
2   (2) 'c'          shift  4                        null
4   (2) 'c'          reduce root: root 'c';          null
0   (2) 'c'          goto   2                        null
2   (2) 'c'          shift  4                        null

and s.dump() shows how the stack has changed:

active stack:
  0     
  2     
  4     

Finally, s.parse(null) concludes parsing and the trace

4   (0) $eof         reduce root: root 'c';          null
0   (0) $eof         goto   2                        null
2   (0) $eof         accept                          null

and returns the result null.

The trace shows that the parser sends shift, reduce, and goto messages to the observer, accompanied by target state or rule information, and finally an accept message. The first column of the trace contains the current state number.

Implementation

page.SLR1 owns the state stack so that input can be pushed. On the first call page.SLR1#parse initializes the stack:

if (!('stack' in this))
  this.stack = [ { state: 0 } ];

For every call the current state object and input tuple are set up:

var state = this.states[ this.stack[this.stack.length - 1].state ];
var current = null;  // current input, none yet available

The parser enters an infinite loop which will be terminated by a need for another input sequence, an accept action, or an Error:

try {
  while (true) {
    // get an input tuple, if needed
    if (!current) next();

    // deal with the input
    ...
  }
} catch (outcome) {
  // request another input sequence?
  if (outcome === true) return true;

  // end parsing: remove stack
  delete this.stack;

  // really bad error?
  if (outcome instanceof Error) throw outcome;

  return outcome[0];  // from the observer, if any
}

page.SLR1#parse has a few local functions for abstraction. try and catch are used so that, e.g., next() can transfer input tuples from an array or from successive function calls, one by one, to current and throw true to request another input sequence.

perform() implements the various actions, communicates with the observer, if any, and returns true if current is consumed. This function throws an array with the last value returned by the observer, if any, once accept is reached.

Finally, recover() implements error recovery and might throw an Error if things go really wrong.

Given these functions dealing with input is easy:

    // deal with the input
    if (current.t != this.bnf.token()         // not $error tuple
        && current.t.ord in state.actions) {  // anticipated input
      if (perform(state.actions[current.t.ord], current))
        next();                               // consumed
    } else
      recover();

An $error tuple represents input which the tokenizer did not understand. It, as well as an input which is not a key in the action table of the current state, will have to be reported by recover before error recovery is attempted.

perform receives the action and the corresponding input and uses another local function observe() to communicate with the observer, if any, and ensure that error messages get printed and counted. An observer can throw a string to be printed as an error message or an Error to abort parsing.

function perform (_action, _tuple) {
  var result = observe(_tuple, _action.action, _action.info);

accept and shift are very simple to implement. accept terminates parsing and returns the result from the observer, if any. A shift action has the new state number as info property.

  switch (_action.action) {
  case 'accept':
    throw [ result ]; // success

  case 'shift':                         // shift to new state
    this.stack.push({ state: _action.info, value: result });
    state = this.states[_action.info];
    return true;                        // _tuple consumed

As discussed before, an observer will collect values and present them to semantic actions. The collection must be performed in parallel to the state changes so that the values collected for a rule can be identified. Therefore, each state change saves the value returned by the observer on the stack.

A reduce action references the rule to be reduced so that the resulting non-terminal and the rule length can be retrieved. reduce is more complicated because it is always followed by a goto action, also with a new state number. The reduce action produces a value from the observer, if any, and during the state change in goto the value is saved on the stack.

  case 'reduce':
    // pop the stack by the length of the rule, uncover state
    this.stack.length -= _action.info.symbols.length;
    state = this.states[this.stack[this.stack.length - 1].state];

    // there has to be a goto for the rule's non-terminal
    var g = state.actions[_action.info.nt.ord];

At this point g must be a goto action which is sent to the observer — mostly for tracing — and the new state and the value returned by the observer, if any, for reduce are pushed onto the stack:

    observe(_tuple, g.action, g.info);  // observe the goto
                                        // goto to new state
    this.stack.push({ state: g.info, value: result });
    state = this.states[g.info];
    return false;                       // _tuple still available
  }
}

Collecting values

page.SLR1#parse has the current parser state on top of a stack. New states are reached through shift and goto actions and pushed onto the stack. For each state the stack holds an associated value which is either null or returned by an observer for the shift message or for the reduce message preceding the goto message, respectively.

An observer is responsible for a semantic action when a rule is reduced. The observer receives a reference to the rule as part of the reduce message. Therefore, the observer can access the exact number of stack entries which contain the associated values.

page.SLR1#yaccpar returns an observer function which is closely patterned after the yacc parser generator. For a shift message this observer will return the string representing the terminal, literal or token, in the input. for a reduce message, without a Builder function, the observer will return the first collected value or null if the rule is empty. As an example, for the grammar

root: 'a';
root: 'a' 'b';
root: root 'c';

and the input

a b c c

the following code

s = new page.SLR1(page.BNF.from( Grammar.value ));
t = s.tokenizer();
o = page.trace(s.yaccpar());
s.parse(t( Input.value ).concat(null), o);

results in this trace:

AT  TUPLE            ACTION                          RETURN
0   (2) 'a'          shift  1                        str a
1   (2) 'b'          shift  3                        str b
3   (2) 'c'          reduce root: 'a' 'b';           str a
0   (2) 'c'          goto   2                        null
2   (2) 'c'          shift  4                        str c
4   (2) 'c'          reduce root: root 'c';          str a
0   (2) 'c'          goto   2                        null
2   (2) 'c'          shift  4                        str c
4   (0) $eof         reduce root: root 'c';          str a
0   (0) $eof         goto   2                        null
2   (0) $eof         accept                          str a
a

I.e., all the literals are collected but page.SLR1#yaccpar retains only the very first one across reduce messages. This changes completely if a semantic action for root is added:

actions = {
  root: function (values) { return '[' + values.join(', ') + ']'; }
};

o = page.trace(s.yaccpar(null, actions), /reduce/);
s.parse(t( Input.value ).concat(null), o);

Now the Builder function accumulates the literals and previous results and builds something close to a syntax tree:

AT  TUPLE            ACTION                          RETURN
3   (2) 'c'          reduce root: 'a' 'b';           str [a, b]
4   (2) 'c'          reduce root: root 'c';          str [[a, b], c]
4   (0) $eof         reduce root: root 'c';          str [[[a, b], c], c]
[[[a, b], c], c]

Semantic actions

page.SLR1#reflect returns an observer which does not collect literal values. For a reduce message, the collected non-null values are concatenated into an array, i.e., null and empty arrays are ignored and other arrays are flattened. If there is no Builder function with the name of the non-terminal for the rule to be reduced, the values are returned by the observer, i.e., they are collected for the next outer reduce message. As an example, the following grammar

%token Name '[a-z]+';

count:  null empty Name double;
null:   ;
empty:  ;
double: Name;

can be used to parse the input

a b

and without semantic actions the observer created by page.SLR1#reflect will collect an array containing a and b. With the actions

actions = {
  'null':   function (values) { return null; },
  'double': function (values) { values.push(values[0]); return values; },
  count:    function (values) { return values.length; }
};

the observer will return the count 3, i.e., the input strings a and two b.

This count would also result if null were collected and the result of double not flattened; however, two push operations in the double action result in a count of 4, disproving this theory.

Adding the action

  empty:    function (values) { return []; },

shows that an empty array or an empty rule does not contribute a value. Inserting a string or a number into the empty array will increase the count.

The grammar is both, LR(1) and LL(1). If the same input is processed using page.LL1#parser and page.LL1#reflect with the same actions, the count remains unaffected by whatever the actions for null and empty return because the parsing algorithm does not enter empty rules:

AT  TUPLE            ACTION                          RETURN
0   (2) Name 'a'     rule   $accept: count $eof;     null
1   (2) Name 'a'     rule   count: null empty Name   null
1   (2) Name 'a'     shift  Name                     null
4   (2) Name 'b'     rule   double: Name;            null
4   (2) Name 'b'     shift  Name                     null
4   (0) $eof         reduce double: Name;            [ ] b,b
1   (0) $eof         reduce count: null empty Name   num 3
0   (0) $eof         shift  $eof                     null
0   (0) $eof         reduce $accept: count $eof;     [ ] 3
3

The trace shows that empty rules are not reduced by the LL(1) parser.

page.SLR1#reflect works just like page.LL1#reflect, i.e., all the examples from Observers can be run by changing from LL(1) to SLR(1).

Error recovery

page.SLR1#parse encounters an error when the current input terminal is not a key in the current state's action table. It was explained above that the local function recover() has to deal with this situation.

recover() first composes a message indicating all input symbols which would be acceptable in the current state:

function recover () {
  var msg = current + ' is not allowed\nexpecting:';
  for (var a in state.actions)
    if (state.actions[a].symbol instanceof page.BNF.T
        && state.actions[a].symbol != this.bnf.token()) // not $error
      msg += ' ' + state.actions[a].symbol;
  observe(current, 'error', msg);

observe() gives the observer, if any, a chance to abort parsing and prints the message otherwise.

Next, recover() discards all input tuples containing the $error token:

  while (current.t == this.bnf.token())
    next();

These tuples are produced by the tokenizer discussed previously for illegal input characters.

At this point next() might have to request more input to be pushed to the parser. This aborts recovery and might result in another error message.

It is time to discard input, states, or both, until input is expected by a state. Like it's namesake in one of the LL(1) implementations this recover(), too, could traverse backwards over the state stack to find anticipated inputs and act on a match. However, in order to give a grammar designer some control over the recovery process, page implements the error token name which yacc also uses.

error is a predefined, reserved token name which can be used in a grammar like any other token name, but it does not match any actual input; in particular, it does not match $error when sent by the tokenizer. Instead, if an actual input is not expected in a current state, recover() discards states until it finds a state with a shift action for error:

  do {
    while (this.bnf.token().ord in state.actions) {
      // have action for error
      if (perform(state.actions[this.bnf.token().ord],
            new page.Tuple(current.lineno, this.bnf.token(), null))) {
        // consumed error tuple, i.e., did shift
        ...
      }
      // else did reduce/goto, loop to check this state for error
    }
    // no action for error, pop stack
    this.stack.pop();
    state = this.states[this.stack[this.stack.length - 1].state];
  } while (this.stack.length > 1);
  throw [ 'irrecoverable error' ];
}

If there is any action for error it is carried out. If it is a reduce action, it is followed by goto and the target state is also checked for an action for error. If all the states are discarded without a shift action for error, recover() aborts parsing.

The shift action for error marks the end of discarding states. recover() will stay in the target state which involves configurations where the marker has made it across error. At this point, recover() will discard input until it is expected in the state:

        // consumed error tuple, i.e., did shift
        while (true) {
          while (!(current.t.ord in state.actions)) {
            this.message('discarding ' + current);
            next();
          }

Once the current input is expected, it's action is performed:

          if (perform(state.actions[current.t.ord], current)) {
            next();  // did shift current
            return;  // ready for another tuple
          }
        }

If perform() returns true, the action was a shift and recover() considers it's mission accomplished. Otherwise there was a reduce action followed by goto to a new state which is at least likely to expect the current input. Therefore, reduce() loops back, checks for the action, and if need be discards some more input.

error

If there is no error symbol in a grammar, page.SLR1#parse will abort parsing as soon as unexpected input is found. As an example, if the grammar

root: 'a' 'b';

is used to parse

a c c c b

the following output results:

error: (2) $error 'c' is not allowed
expecting: 'b'
error: (0) $eof is not allowed
expecting: 'a'
irrecoverable error

The second error message results only when recover() calls next() and the end of input tuple is not immediately available. If the entire input is pushed at once

s = new page.SLR1(page.BNFP.from( Grammar.value )); input = s.tokenizer()( Input.value ).concat(null); s.parse(input)

the second error message disappears.

If the rules

root: 'a' error 'b';
root: 'c';

are added, 'c' is not reported as a $error tuple by the tokenizer, recover() reports that input is discarded

error: (2) 'c' is not allowed
expecting: 'b'
discarding (2) 'c'
discarding (2) 'c'
discarding (2) 'c'

and the trace

5   (2) 'b'          shift  6                        null
6   (0) $eof         reduce root: 'a' $error 'b';    [ ] 
0   (0) $eof         goto   3                        null
3   (0) $eof         accept                          [ ] 

shows that b is properly recognized.

There seem to be two conflicting objectives for adding error tokens to a grammar:

Close to the start symbol of the grammar so that recover() has a high chance to find a state expecting error low on the stack, i.e., to avoid any irrecoverable error.

Far away from the start symbol so that recover() cannot discard many states before error is found, i.e., to recover but leave most of the parser state intact.

error in iterations

Practical experience suggests a different approach: programming language grammars employ iterations at several levels, e.g, for declarations, parameter lists, statements, argument lists, etc. A method to make iterations termination-proof would go a long way towards robust error recovery for programming language grammars. Fortunately, the method exists:

%token Number '[0-9]+';

program: line;
program: program line;

line:    'opt' opt;
line:    'some' some;
line:    'many' many;

opt:     ;
opt:     Number;
opt:     error;

some:    Number;
some:    some Number;
some:    error;
some:    some error;

many:    ;
many:    many Number;
many:    error;
many:    many error;

This grammar contains a few iterations: an optional item, a sequence of one or more items, and a sequence which may be empty. For LR(1) parsing the iterations can employ left-recursion. The error rules are constructed by placing error into empty rules and in place of every terminal. The following input tries to be an exhaustive test:

opt
opt 1
opt ','

some 2
some 3 4
some ,
some 5 ,

many
many 6
many ,
many 7 , 8

There is a shift/reduce conflict in the grammar and there are five parsing errors, but all numbers are found:

state 3 has a shift/reduce conflict: $error and rule (many: ;)
error: (4) $error '\'' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
error: (8) $error ',' is not allowed
expecting: Number
error: (9) $error ',' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
error: (13) $error ',' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
error: (14) $error ',' is not allowed
expecting: $eof 'some' 'opt' 'many' Number
1,2,3,4,5,6,7,8

Similarly, here are the rules for delimited sequences:

some-list:   Number;                  # one or more
some-list:   some-list ',' Number;
some-list:   error;
some-list:   some-list error Number;
some-list:   some-list ',' error;
some-list:   some-list error;

many-list:  ;                         # zero or more
many-list:  some-list;

The following exhaustive input can be sent to either some-list or many-list:

some-list
some-list 9
some-list 10 , 11
some-list , 12
some-list 13 14
some-list 15 , , 16
some-list 17 ,

Again, lots of errors, but all numbers are collected!

It is tempting to remove all rules which just have error on the right-hand side and add an error one level up:

line: error;

This produces many more shift/reduce conflicts and if some-list starts out with an error the number 12 will not be collected — it pays to move the error symbols as far away from the start symbol as possible.

EBNFP

page.EBNFP represents an extended BNF grammar with precedences, translates the grammar to BNF, and represents it as a page.BNFP object so that both, BNF and extended BNF, can be used to specify grammars with precedences for LR(1) parsing. It makes sense to add error rules during the translation:

%token Number '[0-9]+';

program:   line+;

line:      'opt' opt | 'some' some | 'many' many
    |      'some-list' some-list | 'many-list' many-list;

opt:       Number?;
some:      Number+;
many:      Number*;
some-list: Number +/ ',';
many-list: Number */ ',';

This is an EBNF grammar for the same language as before and page.EBNFP.from represents it as an object and translates it into BNF:

g = page.EBNFP.from( Grammar.value ,true);
print( g.bnf )

The flag true requests that error rules are inserted, which cause a lot of conflicts when a parser is created:

s = new page.SLR1(g);
input = s.tokenizer()( Input.value ).concat(null);
s.parse(input, s.reflect())

Still, when the same input as before is processed, all numbers are recognized.

It was shown previously that conflicts can be hidden with precedence levels and %prec clauses. Therefore, if a precedence level

%right error;

assigning right-associativity is detected in a grammar, page.EBNFP will do both during translation, insert error rules and add appropriate %prec clauses; the flag mentioned above need not be specified.

Translating EBNF

Translating the various constructs of extended BNF to BNF is not very complicated but there are subtle pitfalls. An EBNF grammar and the associated BNF grammar share all terminals and the names of all non-terminals. Additional, invisible names start with $ which is otherwise illegal.

To make semantic actions predictable, the translation must guarantee that a BNF rule with a visible non-terminal name is only reduced once in the course of reducing all parts of the EBNF right-hand side, and offered all collected values in proper sequence. This is arranged for by the default for the observers generated by page.LL1#reflect and page.SLR1#reflect. Invisible non-terminals cannot have semantic actions. If there are none, all collected values are passed on to next outer semantic action.

Four constructs need to be translated: groups, choices, sequences, and delimited lists.

Groups are enclosed by parentheses. The content of a group can be assigned to an invisible non-terminal and the non-terminal replaces the parentheses. The translation is implemented in page.Seq#toBNF.

Choices are delimited by | within a set. Each choice can be assigned as another right-hand side of an invisible non-terminal and the non-terminal replaces the complete set of choices. The translation is implemented in page.Alt#toBNF.

Sequences and lists are various forms of iterations and page.EBNF and page.EBNFP use right- and left-recursion, respectively, to handle those so that the results are suitable for page.LL1 and page.SLR1, respectively. The grammar shown above contains all iterative constructs

%token Number '[0-9]+';

line:      'opt' opt | 'some' some | 'many' many
    |      'some-list' some-list | 'many-list' many-list;

opt:       Number?;
some:      Number+;
many:      Number*;
some-list: Number +/ ',';
many-list: Number */ ',';

and

page.EBNF.from( Grammar.value ).bnf

shows how each is translated to right-recursive BNF,

page.EBNFP.from( Grammar.value ).bnf

shows how each is translated to left-recursive BNF, and if

%right error;

is added to the grammar

page.EBNFP.from( Grammar.value ).bnf

shows BNF with error rules and %prec clauses. Here are the design patterns:

construct EBNF EBNFP %right error;
opt: Number?; opt: ;
opt: Number;
same opt: %prec error;
opt: Number;
some: Number+; some: Number $#;
$#: ;
$#: Number $#;
some: Number $#;
$#: ;
$#: $# Number;
some: Number $# %prec error;
$#: %prec error;
$#: $# Number;
$#: $# error;
many: Number*; many: $#;
$#: ;
$#: Number $#;
many: $#;
$#: ;
$#: $# Number;
many: $# %prec error;
$#: %prec error;
$#: $# Number;
$#: $# error;
some-list:
  Number +/ ',';
some-list: Number $#;

$#: ;
$#: ',' Number $#;
some-list: Number $#;

$#: ;
$#: $# ',' Number;
some-list:
  Number $# %prec error;
$#: %prec error;
$#: $# ',' Number;
$#: $# error Number;
$#: $# ',' error;
$#: $# error;
many-list:
  Number */ ',';
many-list: ;
many-list: Number $#;

$#: ;
$#: ',' Number $#;
many-list: ;
many-list: Number $#;

$#: ;
$#: $# ',' Number;
many-list: %prec error;
many-list:
  Number $# %prec error;
$#: %prec error;
$#: $# ',' Number;
$#: $# error Number;
$#: $# ',' error;
$#: $# error;

The translation is implemented in page.EBNF.Rep#toBNF for opt, some, and many, and page.EBNF.Lst#toBNF for some-list and many-list using right-recursion, and in page.EBNFP.Rep#toBNF and page.EBNFP.Lst#toBNF using left-recursion, error rules and %prec clauses. The factories page.EBNF and page.EBNFP ensure that the grammar constructs are represented with the appropriate objects.

Previous: LR(1)