The idea
The following grammar describes sequences starting with a single a
, optionally followed by a single b
,
and ending in any number of c
:
root: 'a';
root: 'a' 'b';
root: root 'c';
The grammar is left-recursive and the first two rules for root
start with the same terminal symbol,
i.e., the grammar is definitely not LL(1).
However, it is not ambiguous because a sentence starts either with a
or with a
and b
and that uniquely determines
how each syntax tree is built.
If recognition is based on the grammar how should it proceed? For LR(1), rather than seleting one rule as in LL(1), a position marker is placed at the beginning of the rules and advanced as input arrives:
The result is a directed graph where the nodes are recognition states which contain marked rules and the edges indicate transitions based on input terminals or recognized non-terminals.
A non-terminal is recognized exactly when a rule can be satisfied, i.e., the graph needs edges for completed rules. A completed rule has the marker at the end of the rule. Edges must lead from states with completed rules to states with rules where the marker is just before a non-terminal corresponding to the completed rule:
In the diagram above these edges are labeled reduce and are marked with double arrowheads. A transition based on a terminal is called a shift and a transition based on a non-terminal is called a goto. A reduce transition is always immediately followed by a goto transition.
The diagram above is not quite complete because rule 0
$accept: root $eof
still needs to be pictured:
This provides a unique start state 0 for recognition, namely the state which contains rule 0 marked at the beginning.
It also provides a unique state where recognition ends successfully, namely the state which contains rule 0 marked at the end, following $eof
.
One more thought and the recognition technique is complete: The diagram based on states composed of marked rules is finite because the number and length of rules, both, are finite. However, a grammar can describe a set of nested parentheses of arbitrary depth, i.e., recognition needs a mechanism for unlimited counting which a finite diagram cannot provide.
Therefore, the states are kept on a state stack with the current state on top. Each shift and goto transition, i.e., recognizing an input terminal or a completed non-terminal, pushes one new state on top of the stack. A reduce transition pops states off the stack, namely exactly as many states as it took to complete the rule which is reduced, i.e., exactly as many states as there are symbols in the right-hand side of the rule. This can be counted out in the diagram.
The architecture
page uses similar class architectures for LL(1) and LR(1) parsing. A page.SLR1 object wraps a grammar for parsing. If the grammar is represented as a page.EBNF object the corresponding page.BNF object will be used because only BNF rules can be marked with a single marker per rule.
The state table is computed during construction and used during parsing. The state stack is held by the object, i.e., there can only be one parsing operation at a time.
The parsing method page.SLR1#parse uses the same parameters as page.LL1#parse, i.e., input can be a sequence of page.Tuple objects or a function providing such sequences. Because the parsing state is stacked, successive input sequences can be pushed to the parsing method.
Recognition can be observed with page.trace, although LR(1) parsing sends slightly different messages. Semantic actions again use the Builder interface and are selected by an observer created with page.SLR1#reflect.
The classic LR(1) implementations
yacc
and
bison
spend considerable effort on compacting the state table.
page is more tutorial in nature;
therefore, the state table is an array of page.SLR1.State objects.
Each holds an array of one or more page.SLR1.Config objects, i.e., the marked rules,
and an actions
collection which maps the symbols for the outgoing edges of the state to the possible actions
shift
, goto
, reduce
, and accept
.
For display purposes these actions are encoded as page.SLR1.Action objects.
Constructing the state table
When a page.SLR1 object is constructed, the first page.SLR1.State object is created with rule 0 marked at the beginning:
$accept: ● start-symbol $eof;
The page.SLR1 object has factory methods page.SLR1#config and page.SLR1#state for page.SLR1.Config and page.SLR1.State objects, and it contains the state table:
this.states.push(this.state([ this.config(this.bnf.rules[0], 0) ]));
A state is created with one or more marked rules. Together they are called the core configurations and they uniquely define the state. The state additionally contains the closure of the core: all rules for all non-terminals which immediately follow a marker. Here is the factory method:
page.SLR1.prototype.state = function (_core) {
var coreLength = _core.length;
var actions = {};
// compute closure: loop over core and added configurations
...
return new page.SLR1.State(_core, coreLength, actions);
};
The state stores the number of configurations in the core because only cores have to be compared
to determine if two states are equal.
The actions
collection will contain all symbols which immediately follow a marker.
This is set up when the rules in the closure are added to _core
:
// compute closure: loop over core and added configurations
for (var c = 0; c < _core.length; ++ c)
// for each configuration where the marker is not at the end
if (!_core[c].complete) {
// next symbol in the configuration
var s = _core[c].rule.symbols[_core[c].position];
// if it is a non-terminal and not seen before
if (s instanceof page.BNF.NT && !(s.ord in actions))
// add all rules for a new non-terminal, marked at 0
s.rules.forEach(function (rule) {
_core.push(this.config(rule, 0));
}, this);
// add this next terminal or non-terminal
actions[s.ord] = null; // no action yet
}
This loop runs as long as new marked rules are added —
until it reaches the eventual end of _core
.
For a configuration .complete
is true if the marker follows all symbols.
In a similar loop, all state objects are then asked to page.SLR1.State#advance, i.e.,
to create outgoing edges and more states as needed to terminate the edges, and to fill in the actions
table:
for (var s = 0; s < this.states.length; ++ s)
this.states[s].advance(this, s);
The heavy lifting happens in the advance
method:
page.SLR1.State.prototype.advance = function (_factory, _stateNumber) {
// create reduce actions for complete configurations
this.configs.forEach(function (config) {
if (config.complete) {
...
}
}, this);
The first step is to enter reduce
actions into the actions
table
for all complete
configurations. The details are discussed in the following sections.
Each literal, token, and non-terminal has a unique ordinal number, starting with 0.
actions
maps this number to an action. Once the reduce
actions have been entered,
all other entries in actions
will have to be shift
or goto
, with one exception:
// create accept/goto/shift actions for each next symbol which has none
for (var a in this.actions)
if (this.actions[a] == null)
if (a == _factory.bnf.lit().ord) {
// special case: $eof
this.actions[a] = _factory.accept();
_factory.bnf.rules[0].reduced = true;
$eof
only appears in rule 0 and this symbol can only lead to a reduce
or accept
action.
accept
means that rule 0 is successfully recognized.
If the symbol is not $eof
, the action will be shift
or goto
and needs a target state
which is computed by advancing the marker across the symbol in all configurations
where the marker is currently just before the symbol:
} else {
// create next core by advancing marker over one symbol
var next = [ ];
var symbol = null;
this.configs.forEach(function (config) {
// find a as next symbol in all configs
if (!config.complete && a == config.rule.symbols[config.position].ord) {
// remember symbol and push config with marker after symbol
symbol = config.rule.symbols[config.position];
next.push(config.advance(_factory));
}
});
next
is used to collect all new configurations with the marker moved across the symbol
.
page.SLR1.Config#advance uses page.SLR1#config to create a new configuration
with the same rule but the marker in the next position. next
contains at least one configuration,
otherwise a
would not have been a key in actions
for this state.
next
is the core of the target state for the shift
or goto
action to be recorded for
the key a
in actions
. If the core cannot be found among the known states, a new state needs to be created:
// add new state with next as core, if any
// shift/goto existent or new state
if (!_factory.states.some(
function (state, s) {
// compare new core and existing state
if (!state.equals(next)) return false;
// found: create action
this.actions[a] = _factory.shift_or_goto(symbol, s);
return true;
}, this)) {
// not found: create action and new state
this.actions[a] = _factory.shift_or_goto(symbol, _factory.states.length);
_factory.states.push(_factory.state(next));
}
}
};
This concludes creating all but the reduce
actions in page.SLR1.State#advance,
and it looks as if nothing can go wrong.
Unfortunately, the reduce
actions show that there can be problems — not every grammar is suitable for LR(1) parsing.
Conflicts
The previous section did not discuss how reduce
actions are entered
into a state's action table before the other actions are filled in.
A reduce
action requires a complete
configuration
and it only makes sense to enter them for terminals which can follow the configuration's rule's non-terminal:
// create reduce actions for complete configurations
this.configs.forEach(function (config) {
if (config.complete) {
var rule = config.rule; // rule we are in
// for each terminal which can follow the rule in the grammar
for (var t in rule.nt.follow) { // ordinal number
var f = rule.nt.follow[t]; // terminal which can follow
if (!(t in this.actions)) { // can it follow in this state?
// if t is not in actions it cannot follow this state -> reduce
rule.reduced = true;
this.actions[t] = _factory.reduce(f, rule);
}
If such a terminal is not yet a key in the state's action table,
it cannot lead from the state to another one;
therefore, this terminal needs to be added with a reduce
action.
However, if the terminal is already a key in the action table, there is a conflict.
If the terminal was associated with a reduce
action previously,
the terminal is in the follow
set for some other complete configuration,
i.e., there are two different rules which can be reduced before the terminal is accepted:
else if (this.actions[t] != null) {
// t is in actions and actions[t] is already set as a reduce
var r2 = this.actions[t].info; // the other rule
++ _factory.rr;
this.message('state', _stateNumber, 'for', f,
'reduce/reduce conflict between', '(' + rule + ')', 'and', '(' + r2 + ')');
// resolve for rule which is first in the grammar
if (rule.index < r2.index) this.actions[t].info = rule;
}
Technically, this reduce/reduce conflict makes the grammar unsuitable for LR(1) parsing because the state table is not unique. Traditionally, as a stopgap measure in such a case, the first rule in the grammar is selected to be reduced.
If there is no prior reduce
action, the terminal is in the action table
because it will be associated with a shift
action when the marker is moved across it,
as discussed in the previous section:
else { // this.actions[t] == null
++ _factory.sr;
this.message('state', _stateNumber, 'has a shift/reduce conflict:',
f, 'and rule', '(' + rule + ')');
} // done with conflicts
} // done with every t which can follow
} // done with every complete configuration
}, this);
This is called a shift/reduce conflict and makes the grammar unsuitable for LR(1) parsing, too. In this case the stopgap measure is to prefer the shift action.
A shift/reduce conflict
Prefering the shift
action is often desirable, for example to resolve the classic dangling else
problem
in favor of the innermost if
statement:
%token Number '[0-9]+';
statement: 'if' Number statement;
statement: 'if' Number statement 'else' statement;
statement: Number;
The grammar above has a shift/reduce conflict for else
but a trace of parsing
if 1
if 2 3 else 4
shows that the inner if
statement is reduced with else
:
2 (3) 'else' reduce statement: Number; [ ] 3
2 (0) $eof reduce statement: Number; [ ] 4
7 (0) $eof reduce statement: 'if' Number [ ] 2,3,4
5 (0) $eof reduce statement: 'if' Number [ ] 1,2,3,4
A reduce/reduce conflict
The built-in resolution for a shift/reduce conflict can often be accepted
but reduce/reduce conflicts are usually more serious and need to be investigated carefully.
Here is a grammar, hugely abstracted,
where an expression
is based on a condition
or arithmetic
and where a condition
is based on a comparison or arithmetic:
%token Number '[0-9]+';
expression: condition;
expression: Number;
condition: sum '<' sum;
sum: Number;
This grammar has a reduce/reduce conflict between the second and the last rule
which is resolved in favor of the second rule.
Tracing shows that a single Number
such as 0
can be recognized.
However, if the last rule is moved into second place the conflict is resolved in favor of that rule
and now a 0
cannot be recognized!
“Real” LR(1)
As a final note, the implementation discussed here is a slight simplification of LR(1) and is, therefore, called SLR(1).
Using the follow
set of the rule's non-terminal when a complete configuration is processed
is convenient but not as restrictive as possible.
Each configuration starts with a nonterminal in a known context
and one could consider only those terminals which can follow in that context,
i.e., each configuration should include it's own follow
set.
This results in larger state tables but potentially fewer conflicts
and, therefore, more grammars which are suitable for LR(1) parsing.
Precedence
The LL(1) grammar for interpreting arithmetic expressions was a bit cumbersome and operator associativity had to be implemented in the semantic actions, not in the grammar itself. The following grammar is more natural:
%token Number '[0-9]+';
sum: add | sub | prod;
add: sum '+' prod;
sub: sum '-' prod;
prod: mul | div | factor;
mul: prod '*' factor;
div: prod '/' factor;
factor: pow | prim;
pow: prim '**' factor;
prim: num | '(' sum ')';
num: Number;
page translates this grammar to BNF so that rules can be properly marked.
Irrespective of the translation,
the rules for most operators are left-recursive, corresponding to the left-associativity of the operators;
only exponentiation is right-associative and expressed with a right-recursive rule.
As in the LL(1) example, operator precedence is encoded into the grammar
because the non-terminals sum
, prod
, factor
, and prim
establish appropriate levels.
add
, sub
, etc. are hooks for semantic actions and a typical action such as
add: function (values) { return values[0] + values[1]; }
does not have to consider associativity.
On the other hand, the grammar
%token Number '[0-9]+';
expr: add | sub | mul | div | pow | '(' expr ')' | num;
add: expr '+' expr;
sub: expr '-' expr;
mul: expr '*' expr;
div: expr '/' expr;
pow: expr '**' expr;
num: Number;
is even shorter but it lacks any information about operator precedence and associativity and page notes 25 shift/reduce conflicts. Following yacc this information can be specified as a sequence of precedence levels following the token definitions and preceding the grammar rules:
%left '+' '-';
%left '*' '/';
%right '**';
Operator precedence increases in the order of the level statements.
Each level statement contains one or more terminal symbols.
Terminal symbols in the same level statement have the same precedence and associativity.
A precedence level can start with %nonassoc
to suppress associativity, e.g., for comparison operators.
Implementation
The level statements have to be added to the grammars for BNF and extended BNF. New subclasses page.BNFP and page.EBNFP of page.BNF and page.EBNF, respectively, are used to represent grammars which include precedence information. The information is only useful for LR(1) parsing.
page.BNFP and page.EBNFP have factory methods page.BNFP#precedence and page.EBNFP#precedence, respectively,
which are used in page.BNFP.Builder and page.EBNFP.Builder, respectively,
to represent a precedence level as a page.BNFP.Precedence object for display purposes.
All terminals in a list for a precedence level receive an additional property .prec
with
associativity and a numerical precedence level.
page.BNFP overrides the factory method page.BNFP#rule to represent a BNF rule as a page.BNFP.Rule object. A rule can have an explicit precedence (see below) or it has the precedence of the last terminal symbol in the rule, if any. The precedence is assigned by the factory method.
Thus the stage is set to arbitrate some conflicts.
The following code is inserted above just before a shift/reduce conflict is reported.
Recall that rule
is in the complete configuration,
f
is one of the follow
terminals, t
is it's key in actions
, and an action has not yet been determined.
If both, rule and terminal, actually have a precedence the shift/reduce conflict can be arbitrated:
else { // this.actions[t] == null
if (rule.prec && 'prec' in f)
// rule and terminal have defined precedence
if (rule.prec.level > f.prec.level) {
// rule's precedence is higher -> reduce
rule.reduced = true;
this.actions[t] = _factory.reduce(f, rule);
} else if (rule.prec.level < f.prec.level) {
// rule's precedence is lower -> shift (done below)
}
If the rule has higher precedence it is reduced — in the example this reduces a multiplication before an addition is tackled.
If the rule has lower precedence a shift
action will move to another state — in the example this delays reducing an addition
until a multiplication as a right argument has been handled.
If rule and terminal are on the same precedence level associativity must decide:
else // equal precedence
switch (rule.prec.assoc) {
case 'left': // left-associative -> reduce
rule.reduced = true;
self.actions[t] = _factory.reduce(f, rule);
case 'right': // right-associative -> shift
break;
default: // non-associative -> error action
rule.reduced = true; // avoid message
delete self.actions[t]; // i.e. f as input would be an error
}
else
// shift/reduce error message
}
If the rule's precedence specifies left-associativity, the rule is reduced — in the example this collects additions, etc., from left to right.
If the rule's precedence specifies right-associativity, a shift
action will move to another state — in the example this collects exponentiation from right to left because the rightmost operation is reduced first.
Finally, the rule's precedence requires non-associativity, the terminal is removed from the action table, i.e., the terminal cannot be accepted in this state — this would prevent chains of comparison operators if they were declared non-associative.
Hiding conflicts
It should be noted that precedences are designed to hide conflicts and they are a natural way to control arithmetic. However, more generally, precedences can be used to hide conflicts but it is usually a good idea to check the state table before and after adding precedence information to make sure it has the intended effect. Here is a “conflict-free” dangling else:
%token Number '[0-9]+';
%right 'else';
statement: 'if' Number statement %prec 'else';
statement: 'if' Number statement 'else' statement;
statement: Number;
A BNF rule can include a trailing %prec
clause to specify precedence and associativity explicitly
by referencing a terminal on a precedence level defined earlier.
For this grammar there is no message about the shift/reduce conflict and
state 5 shows that this state table will still collect else
with the closest if
:
state 5
statement: 'if' Number statement .;
statement: 'if' Number statement . 'else' statement;
$eof reduce (statement: 'if' Number statement %prec 'else';)
'else' shift 6
One might think that changing right
to left
should collect else
with an outer if
but in fact the change renders the grammar useless:
state 5
statement: 'if' Number statement .;
statement: 'if' Number statement . 'else' statement;
$eof reduce (statement: 'if' Number statement %prec 'else';)
'else' reduce (statement: 'if' Number statement %prec 'else';)
errors: 1
The error refers to the fact that now the second grammar rule is never reduced,
i.e., input cannot contain else
.
Actually, this should not come as a surprise: intuitively, the state table cannot wait
with shifting on else
until there is just the last if
left.
EBNF and precedence
The state table is computed for a BNF grammar — or for an EBNF grammar internally translated into BNF. The arithmetic example above demonstrates that precedence levels can be specified for EBNF just as well as for BNF; after all, the terminals are shared between the EBNF grammar and it's translation into BNF.
How can a %prec
clause be translated from EBNF into BNF?
The clause applies to a single BNF rule.
EBNF choices and iterations are translated into several BNF rules,
i.e., a %prec
clause can only be applied to a symbol sequence in EBNF,
which can be a choice or even include iterations.
Here is a typical, small example which shows how to add a unary minus to the arithmetic example above:
%token Number '[0-9]+';
%left '-';
expr: sub | minus | num;
sub: expr '-' expr;
minus: '-' expr;
num: Number;
With the semantic actions
actions = {
sub: function (values) { return values[0] - values[1]; },
minus: function (values) { return - values[0]; },
num: function (values) { return Number(values[0]); }
};
this will get 1
as the result of evaluating
-1 - -2
i.e., the unary minus has precedence over the binary minus because the grammar enforces that.
Adding another precedence level and a %prec
clause
%token Number '[0-9]+';
%right Number;
%left '-';
expr: sub | minus | num;
sub: expr '-' expr;
minus: '-' expr %prec Number;
num: Number;
gives the unary minus less precedence than the binary minus and evaluates the input above to be -3
.
It is instructive to trace both evaluations. With the %prec
clause,
5 (0) $eof reduce expr: minus; [ ] -3
is the last reduce
message, without it it is
4 (0) $eof reduce expr: sub; [ ] 1
Technically, this is an EBNF grammar because of the choices for expr
.
The %prec
clause is not applied to the rule for minus
but to the sequence '-'
expr
which is one of several choices which the rule for minus
could have.
The BNF translation can be seen by executing
page.EBNFP.from( Grammar.value ).bnf
in the web page.
Why use Number
in the level statement for the experiment?
All that is required is to connect the unary minus with a %prec
clause
to a terminal symbol which is on a level below binary minus.
Assigning a precedence to Number
through a level statement
does not affect the state table and does not introduce a new, otherwise unused terminal symbol.
Trick question: without a %prec
clause and without changes to the grammar structure, how can
-1 -- 2 ---- 4
be evaluated as either -7
or 5
?