LR(1)

The idea

The following grammar describes sequences starting with a single a, optionally followed by a single b, and ending in any number of c:

root: 'a';
root: 'a' 'b';
root: root 'c';

The grammar is left-recursive and the first two rules for root start with the same terminal symbol, i.e., the grammar is definitely not LL(1). However, it is not ambiguous because a sentence starts either with a or with a and b and that uniquely determines how each syntax tree is built.

If recognition is based on the grammar how should it proceed? For LR(1), rather than seleting one rule as in LL(1), a position marker is placed at the beginning of the rules and advanced as input arrives:

transition diagram

The result is a directed graph where the nodes are recognition states which contain marked rules and the edges indicate transitions based on input terminals or recognized non-terminals.

A non-terminal is recognized exactly when a rule can be satisfied, i.e., the graph needs edges for completed rules. A completed rule has the marker at the end of the rule. Edges must lead from states with completed rules to states with rules where the marker is just before a non-terminal corresponding to the completed rule:

transition diagram

In the diagram above these edges are labeled reduce and are marked with double arrowheads. A transition based on a terminal is called a shift and a transition based on a non-terminal is called a goto. A reduce transition is always immediately followed by a goto transition.

The diagram above is not quite complete because rule 0

$accept: root $eof

still needs to be pictured:

transition diagram

This provides a unique start state 0 for recognition, namely the state which contains rule 0 marked at the beginning. It also provides a unique state where recognition ends successfully, namely the state which contains rule 0 marked at the end, following $eof.

One more thought and the recognition technique is complete: The diagram based on states composed of marked rules is finite because the number and length of rules, both, are finite. However, a grammar can describe a set of nested parentheses of arbitrary depth, i.e., recognition needs a mechanism for unlimited counting which a finite diagram cannot provide.

Therefore, the states are kept on a state stack with the current state on top. Each shift and goto transition, i.e., recognizing an input terminal or a completed non-terminal, pushes one new state on top of the stack. A reduce transition pops states off the stack, namely exactly as many states as it took to complete the rule which is reduced, i.e., exactly as many states as there are symbols in the right-hand side of the rule. This can be counted out in the diagram.

The architecture

page uses similar class architectures for LL(1) and LR(1) parsing. A page.SLR1 object wraps a grammar for parsing. If the grammar is represented as a page.EBNF object the corresponding page.BNF object will be used because only BNF rules can be marked with a single marker per rule.

The state table is computed during construction and used during parsing. The state stack is held by the object, i.e., there can only be one parsing operation at a time.

The parsing method page.SLR1#parse uses the same parameters as page.LL1#parse, i.e., input can be a sequence of page.Tuple objects or a function providing such sequences. Because the parsing state is stacked, successive input sequences can be pushed to the parsing method.

Recognition can be observed with page.trace, although LR(1) parsing sends slightly different messages. Semantic actions again use the Builder interface and are selected by an observer created with page.SLR1#reflect.

The classic LR(1) implementations yacc and bison spend considerable effort on compacting the state table. page is more tutorial in nature; therefore, the state table is an array of page.SLR1.State objects. Each holds an array of one or more page.SLR1.Config objects, i.e., the marked rules, and an actions collection which maps the symbols for the outgoing edges of the state to the possible actions shift, goto, reduce, and accept. For display purposes these actions are encoded as page.SLR1.Action objects.

Constructing the state table

When a page.SLR1 object is constructed, the first page.SLR1.State object is created with rule 0 marked at the beginning:

$accept: ● start-symbol $eof;

The page.SLR1 object has factory methods page.SLR1#config and page.SLR1#state for page.SLR1.Config and page.SLR1.State objects, and it contains the state table:

this.states.push(this.state([ this.config(this.bnf.rules[0], 0) ]));

A state is created with one or more marked rules. Together they are called the core configurations and they uniquely define the state. The state additionally contains the closure of the core: all rules for all non-terminals which immediately follow a marker. Here is the factory method:

page.SLR1.prototype.state = function (_core) {
  var coreLength = _core.length;
  var actions = {};
  // compute closure: loop over core and added configurations
  ...
  return new page.SLR1.State(_core, coreLength, actions);
};

The state stores the number of configurations in the core because only cores have to be compared to determine if two states are equal. The actions collection will contain all symbols which immediately follow a marker. This is set up when the rules in the closure are added to _core:

// compute closure: loop over core and added configurations
for (var c = 0; c < _core.length; ++ c)
  // for each configuration where the marker is not at the end
  if (!_core[c].complete) {
    // next symbol in the configuration
    var s = _core[c].rule.symbols[_core[c].position];
    // if it is a non-terminal and not seen before
    if (s instanceof page.BNF.NT && !(s.ord in actions))
      // add all rules for a new non-terminal, marked at 0
      s.rules.forEach(function (rule) {
        _core.push(this.config(rule, 0));
      }, this);
    // add this next terminal or non-terminal
    actions[s.ord] = null; // no action yet
  }

This loop runs as long as new marked rules are added — until it reaches the eventual end of _core.

For a configuration .complete is true if the marker follows all symbols.

In a similar loop, all state objects are then asked to page.SLR1.State#advance, i.e., to create outgoing edges and more states as needed to terminate the edges, and to fill in the actions table:

for (var s = 0; s < this.states.length; ++ s)
  this.states[s].advance(this, s);

The heavy lifting happens in the advance method:

page.SLR1.State.prototype.advance = function (_factory, _stateNumber) {
  // create reduce actions for complete configurations
  this.configs.forEach(function (config) {
    if (config.complete) {
      ...
    }
  }, this);

The first step is to enter reduce actions into the actions table for all complete configurations. The details are discussed in the following sections.

Each literal, token, and non-terminal has a unique ordinal number, starting with 0. actions maps this number to an action. Once the reduce actions have been entered, all other entries in actions will have to be shift or goto, with one exception:

  // create accept/goto/shift actions for each next symbol which has none
  for (var a in this.actions)
    if (this.actions[a] == null)
      if (a == _factory.bnf.lit().ord) {
        // special case: $eof
        this.actions[a] = _factory.accept();
        _factory.bnf.rules[0].reduced = true;

$eof only appears in rule 0 and this symbol can only lead to a reduce or accept action. accept means that rule 0 is successfully recognized.

If the symbol is not $eof, the action will be shift or goto and needs a target state which is computed by advancing the marker across the symbol in all configurations where the marker is currently just before the symbol:

      } else {
        // create next core by advancing marker over one symbol
        var next = [ ];
        var symbol = null;
        this.configs.forEach(function (config) {
          // find a as next symbol in all configs
          if (!config.complete && a == config.rule.symbols[config.position].ord) {
            // remember symbol and push config with marker after symbol
            symbol = config.rule.symbols[config.position];
            next.push(config.advance(_factory));
          }
        });

next is used to collect all new configurations with the marker moved across the symbol. page.SLR1.Config#advance uses page.SLR1#config to create a new configuration with the same rule but the marker in the next position. next contains at least one configuration, otherwise a would not have been a key in actions for this state.

next is the core of the target state for the shift or goto action to be recorded for the key a in actions. If the core cannot be found among the known states, a new state needs to be created:

        // add new state with next as core, if any
        // shift/goto existent or new state
        if (!_factory.states.some(
              function (state, s) {
                // compare new core and existing state
                if (!state.equals(next)) return false;
                // found: create action
                this.actions[a] = _factory.shift_or_goto(symbol, s);
                return true;
              }, this)) {
          // not found: create action and new state
          this.actions[a] = _factory.shift_or_goto(symbol, _factory.states.length);
          _factory.states.push(_factory.state(next));
        }
      }
};

This concludes creating all but the reduce actions in page.SLR1.State#advance, and it looks as if nothing can go wrong. Unfortunately, the reduce actions show that there can be problems — not every grammar is suitable for LR(1) parsing.

Conflicts

The previous section did not discuss how reduce actions are entered into a state's action table before the other actions are filled in. A reduce action requires a complete configuration and it only makes sense to enter them for terminals which can follow the configuration's rule's non-terminal:

  // create reduce actions for complete configurations
  this.configs.forEach(function (config) {
    if (config.complete) {
      var rule = config.rule;          // rule we are in

      // for each terminal which can follow the rule in the grammar
      for (var t in rule.nt.follow) {  // ordinal number
        var f = rule.nt.follow[t];     // terminal which can follow

        if (!(t in this.actions)) {    // can it follow in this state?
          // if t is not in actions it cannot follow this state -> reduce
          rule.reduced = true;
          this.actions[t] = _factory.reduce(f, rule);
        }

If such a terminal is not yet a key in the state's action table, it cannot lead from the state to another one; therefore, this terminal needs to be added with a reduce action.

However, if the terminal is already a key in the action table, there is a conflict. If the terminal was associated with a reduce action previously, the terminal is in the follow set for some other complete configuration, i.e., there are two different rules which can be reduced before the terminal is accepted:

        else if (this.actions[t] != null) {
          // t is in actions and actions[t] is already set as a reduce
          var r2 = this.actions[t].info; // the other rule

          ++ _factory.rr;
          this.message('state', _stateNumber, 'for', f,
            'reduce/reduce conflict between', '(' + rule + ')', 'and', '(' + r2 + ')');

          // resolve for rule which is first in the grammar
          if (rule.index < r2.index) this.actions[t].info = rule;
        }

Technically, this reduce/reduce conflict makes the grammar unsuitable for LR(1) parsing because the state table is not unique. Traditionally, as a stopgap measure in such a case, the first rule in the grammar is selected to be reduced.

If there is no prior reduce action, the terminal is in the action table because it will be associated with a shift action when the marker is moved across it, as discussed in the previous section:

        else { // this.actions[t] == null
          ++ _factory.sr;
          this.message('state', _stateNumber, 'has a shift/reduce conflict:',
            f, 'and rule', '(' + rule + ')');
        } // done with conflicts
      } // done with every t which can follow
    } // done with every complete configuration
  }, this);

This is called a shift/reduce conflict and makes the grammar unsuitable for LR(1) parsing, too. In this case the stopgap measure is to prefer the shift action.

A shift/reduce conflict

Prefering the shift action is often desirable, for example to resolve the classic dangling else problem in favor of the innermost if statement:

%token Number '[0-9]+';

statement: 'if' Number statement;
statement: 'if' Number statement 'else' statement;
statement: Number;

The grammar above has a shift/reduce conflict for else but a trace of parsing

if 1
  if 2 3 else 4

shows that the inner if statement is reduced with else:

2   (3) 'else'       reduce statement: Number;       [ ] 3
2   (0) $eof         reduce statement: Number;       [ ] 4
7   (0) $eof         reduce statement: 'if' Number   [ ] 2,3,4
5   (0) $eof         reduce statement: 'if' Number   [ ] 1,2,3,4

A reduce/reduce conflict

The built-in resolution for a shift/reduce conflict can often be accepted but reduce/reduce conflicts are usually more serious and need to be investigated carefully. Here is a grammar, hugely abstracted, where an expression is based on a condition or arithmetic and where a condition is based on a comparison or arithmetic:

%token Number '[0-9]+';

expression: condition;
expression: Number;

condition: sum '<' sum;
sum: Number;

This grammar has a reduce/reduce conflict between the second and the last rule which is resolved in favor of the second rule. Tracing shows that a single Number such as 0 can be recognized. However, if the last rule is moved into second place the conflict is resolved in favor of that rule and now a 0 cannot be recognized!

“Real” LR(1)

As a final note, the implementation discussed here is a slight simplification of LR(1) and is, therefore, called SLR(1). Using the follow set of the rule's non-terminal when a complete configuration is processed is convenient but not as restrictive as possible. Each configuration starts with a nonterminal in a known context and one could consider only those terminals which can follow in that context, i.e., each configuration should include it's own follow set. This results in larger state tables but potentially fewer conflicts and, therefore, more grammars which are suitable for LR(1) parsing.

Precedence

The LL(1) grammar for interpreting arithmetic expressions was a bit cumbersome and operator associativity had to be implemented in the semantic actions, not in the grammar itself. The following grammar is more natural:

%token Number '[0-9]+';

sum: add | sub | prod;
add: sum '+' prod;
sub: sum '-' prod;

prod: mul | div | factor;
mul: prod '*' factor;
div: prod '/' factor;

factor: pow | prim;
pow: prim '**' factor;

prim: num | '(' sum ')';
num: Number;

page translates this grammar to BNF so that rules can be properly marked. Irrespective of the translation, the rules for most operators are left-recursive, corresponding to the left-associativity of the operators; only exponentiation is right-associative and expressed with a right-recursive rule. As in the LL(1) example, operator precedence is encoded into the grammar because the non-terminals sum, prod, factor, and prim establish appropriate levels.

add, sub, etc. are hooks for semantic actions and a typical action such as

add: function (values) { return values[0] + values[1]; }

does not have to consider associativity.

On the other hand, the grammar

%token Number '[0-9]+';

expr: add | sub | mul | div | pow | '(' expr ')' | num;
add: expr '+' expr;
sub: expr '-' expr;
mul: expr '*' expr;
div: expr '/' expr;
pow: expr '**' expr;
num: Number;

is even shorter but it lacks any information about operator precedence and associativity and page notes 25 shift/reduce conflicts. Following yacc this information can be specified as a sequence of precedence levels following the token definitions and preceding the grammar rules:

%left '+' '-';
%left '*' '/';
%right '**';

Operator precedence increases in the order of the level statements. Each level statement contains one or more terminal symbols. Terminal symbols in the same level statement have the same precedence and associativity. A precedence level can start with %nonassoc to suppress associativity, e.g., for comparison operators.

Implementation

The level statements have to be added to the grammars for BNF and extended BNF. New subclasses page.BNFP and page.EBNFP of page.BNF and page.EBNF, respectively, are used to represent grammars which include precedence information. The information is only useful for LR(1) parsing.

page.BNFP and page.EBNFP have factory methods page.BNFP#precedence and page.EBNFP#precedence, respectively, which are used in page.BNFP.Builder and page.EBNFP.Builder, respectively, to represent a precedence level as a page.BNFP.Precedence object for display purposes. All terminals in a list for a precedence level receive an additional property .prec with associativity and a numerical precedence level.

page.BNFP overrides the factory method page.BNFP#rule to represent a BNF rule as a page.BNFP.Rule object. A rule can have an explicit precedence (see below) or it has the precedence of the last terminal symbol in the rule, if any. The precedence is assigned by the factory method.

Thus the stage is set to arbitrate some conflicts. The following code is inserted above just before a shift/reduce conflict is reported. Recall that rule is in the complete configuration, f is one of the follow terminals, t is it's key in actions, and an action has not yet been determined. If both, rule and terminal, actually have a precedence the shift/reduce conflict can be arbitrated:

        else { // this.actions[t] == null
          if (rule.prec && 'prec' in f)
            // rule and terminal have defined precedence
            if (rule.prec.level > f.prec.level) {
              // rule's precedence is higher -> reduce
              rule.reduced = true;
              this.actions[t] = _factory.reduce(f, rule);
            } else if (rule.prec.level < f.prec.level) {
              // rule's precedence is lower -> shift (done below)
            }

If the rule has higher precedence it is reduced — in the example this reduces a multiplication before an addition is tackled.

If the rule has lower precedence a shift action will move to another state — in the example this delays reducing an addition until a multiplication as a right argument has been handled.

If rule and terminal are on the same precedence level associativity must decide:

            else // equal precedence
              switch (rule.prec.assoc) {

              case 'left':   // left-associative -> reduce
                rule.reduced = true;
                self.actions[t] = _factory.reduce(f, rule);

              case 'right':  // right-associative -> shift
                break;

              default:       // non-associative -> error action
                rule.reduced = true;     // avoid message
                delete self.actions[t];  // i.e. f as input would be an error
              }
          else
            // shift/reduce error message
        }

If the rule's precedence specifies left-associativity, the rule is reduced — in the example this collects additions, etc., from left to right.

If the rule's precedence specifies right-associativity, a shift action will move to another state — in the example this collects exponentiation from right to left because the rightmost operation is reduced first.

Finally, the rule's precedence requires non-associativity, the terminal is removed from the action table, i.e., the terminal cannot be accepted in this state — this would prevent chains of comparison operators if they were declared non-associative.

Hiding conflicts

It should be noted that precedences are designed to hide conflicts and they are a natural way to control arithmetic. However, more generally, precedences can be used to hide conflicts but it is usually a good idea to check the state table before and after adding precedence information to make sure it has the intended effect. Here is a “conflict-free” dangling else:

%token Number '[0-9]+';

%right 'else';

statement: 'if' Number statement %prec 'else';
statement: 'if' Number statement 'else' statement;
statement: Number;

A BNF rule can include a trailing %prec clause to specify precedence and associativity explicitly by referencing a terminal on a precedence level defined earlier.

For this grammar there is no message about the shift/reduce conflict and state 5 shows that this state table will still collect else with the closest if:

state 5
  statement: 'if' Number statement .;
  statement: 'if' Number statement . 'else' statement;

  $eof         reduce (statement: 'if' Number statement %prec 'else';)
  'else'       shift 6

One might think that changing right to left should collect else with an outer if but in fact the change renders the grammar useless:

state 5
  statement: 'if' Number statement .;
  statement: 'if' Number statement . 'else' statement;

  $eof         reduce (statement: 'if' Number statement %prec 'else';)
  'else'       reduce (statement: 'if' Number statement %prec 'else';)

errors: 1

The error refers to the fact that now the second grammar rule is never reduced, i.e., input cannot contain else. Actually, this should not come as a surprise: intuitively, the state table cannot wait with shifting on else until there is just the last if left.

EBNF and precedence

The state table is computed for a BNF grammar — or for an EBNF grammar internally translated into BNF. The arithmetic example above demonstrates that precedence levels can be specified for EBNF just as well as for BNF; after all, the terminals are shared between the EBNF grammar and it's translation into BNF.

How can a %prec clause be translated from EBNF into BNF? The clause applies to a single BNF rule. EBNF choices and iterations are translated into several BNF rules, i.e., a %prec clause can only be applied to a symbol sequence in EBNF, which can be a choice or even include iterations. Here is a typical, small example which shows how to add a unary minus to the arithmetic example above:

%token Number '[0-9]+';

%left '-';

expr: sub | minus | num;
sub: expr '-' expr;
minus: '-' expr;
num: Number;

With the semantic actions

actions = {
  sub: function (values) { return values[0] - values[1]; },
  minus: function (values) { return - values[0]; },
  num: function (values) { return Number(values[0]); }
};

this will get 1 as the result of evaluating

-1 - -2

i.e., the unary minus has precedence over the binary minus because the grammar enforces that.

Adding another precedence level and a %prec clause

%token Number '[0-9]+';

%right Number;
%left '-';

expr: sub | minus | num;
sub: expr '-' expr;
minus: '-' expr %prec Number;
num: Number;

gives the unary minus less precedence than the binary minus and evaluates the input above to be -3. It is instructive to trace both evaluations. With the %prec clause,

5   (0) $eof         reduce expr: minus;             [ ] -3

is the last reduce message, without it it is

4   (0) $eof         reduce expr: sub;               [ ] 1

Technically, this is an EBNF grammar because of the choices for expr. The %prec clause is not applied to the rule for minus but to the sequence '-' expr which is one of several choices which the rule for minus could have. The BNF translation can be seen by executing

page.EBNFP.from( Grammar.value ).bnf

in the web page.

Why use Number in the level statement for the experiment? All that is required is to connect the unary minus with a %prec clause to a terminal symbol which is on a level below binary minus. Assigning a precedence to Number through a level statement does not affect the state table and does not introduce a new, otherwise unused terminal symbol.

Trick question: without a %prec clause and without changes to the grammar structure, how can

-1 -- 2 ---- 4

be evaluated as either -7 or 5?

Next: Bottom-up parsing Previous: Bootstrap