page Tutorial: LL(1)

The idea

Parsing means to analyze if a sequence of terminal symbols is a sentence of a language. If it is there has to be a syntax tree. Top-down parsing tries to build the syntax tree from the root to the leaf nodes, i.e., beginning with the start symbol of a BNF grammar for the language as the branch node at the root, through non-terminals as further branch nodes, and finally arriving at the terminals as leaf nodes. The strategy is called recursive descent because the syntax tree is built during recursive visits to non-terminals, descending from the root to the leaves.

The grammar is viewed as a set of functions, one per non-terminal. A function body combines all the rules for a non-terminal, i.e., it contains a set of right-hand sides. Each right-hand side is viewed as a sequence of instructions, one per symbol. A terminal symbol calls for a match with input and results in a leaf node. A non-terminal symbol results in a call to the non-terminals function which builds a branch node. For example:

 0  $accept: addition $eof;
 1  addition: Number more;
 2  more: '+' Number more;
 3  more: ;

Rule 0 requires that addition exactly match the entire input. This is the only rule for the internal start symbol $accept, and the only right-hand side contains addition and a literal $eof for the end of input.

addition is a non-terminal which is interpreted as a function call to build a branch node of the syntax tree. We reach rule 1 with the only right-hand side for addition and it starts with the token Number. Therefore, we can only continue if the first terminal symbol in the input is a Number. If so, we have a leaf node as first descendant of the branch node for addition. We now need to match the next symbol in this right-hand side, i.e., more.

Because more is a non-terminal, it acts as a function call and we need to match one of the right-hand sides for more. The first one, in rule 2, starts with the literal +. If the next input symbol is +, we must continue with the selected right-hand side which now needs to find another Number followed by a (recursive) match for another more. Later the two leaf nodes and the branch node for more will be used as descendants for a branch node labeled more.

top-down parsing

Eventually we will run out of +, i.e., we cannot match rule 2 for more any longer. Therefore, we look for other right-hand sides for more, if any. For more there is rule 3 and its right-hand side is empty, i.e., it will match and not require any input. This results in a branch node for more without a descendant.

The recursive calls end and we are back to rule 1 which can build a branch node for addition with a Number leaf node and a more branch node as descendants.

Back to rule 0, or rather the function for $accept which now needs to match $eof, i.e., the end of input.

In other words, if the input exactly contains one or more Number separated by + this process proves existence of a syntax tree for the input, i.e., these kinds of input are sentences.

Definitions

Unfortunately, the strategy above does not work for all grammars:

addition: addition '+' Number;
addition: Number;

If a rule is left-recursive over one or more levels we end up in an infinite recursion.

addition: Number;
addition: Number '+' addition;

Right-hand sides must not start with the same terminal because we would not know which right-hand side to use — the process cannot backtrack to unmatch already matched input.

addition: more Number;
more: Number '+' more;
more: ;

If a rule has an empty right-hand side all other of its right-hand sides cannot start with a symbol which can also follow input matched by the rule because here the function for more would not know whether to match a Number or leave it for its caller addition.

LL(1) formalizes these observations. It is a sufficient condition for a grammar to be non-ambiguous and certifies the grammar as suitable for input checking by recursive descent as described above. The name indicates the tree construction strategy: left-to-right with one symbol lookahead.

LL(1) can be checked by inspection of a BNF grammar. Two sets, first and follow, are defined for symbols and rules of the grammar. LL(1) then consists of two conditions on these sets as described in the next section below.

`empty`

In order to compute the sets we also define a condition empty for non-terminals and rules.

A rule is empty if its right-hand side contains no symbols. The non-terminal on the left-hand side of this rule is empty, too.

So far, empty can be computed simply by inspecting the grammar.

`first`

By definition, a first set contains those terminal symbols of which one must be in the input when we start to build a branch node of a syntax tree, i.e., when we try to find a rule and match a non-terminal or terminal of the grammar.

The first set of a terminal consists of the terminal itself — if we try to match a terminal in the grammar we need it in the input.

The first set of a non-terminal starts out empty.

The first set of a rule starts with the first set of the first symbol in the right-hand side. As long as this symbol is an empty non-terminal we proceed to the next symbol on the right-hand side and add its first set to the first set of the rule and the non-terminal on the left-hand side.

If all symbols on the right-hand side are empty non-terminals, we call the rule and the non-terminal on the left-hand side, both, empty.

This definition of first and empty requires iterating over the rules until there are no more changes. The iteration will terminate because there is a finite number of terminals, non-terminals, and rules, and all sets can at most contain a finite number of terminals. Here is the implementation, part of page.BNF#init:

do {
  changed = false;
  this.rules.forEach( function (rule) {
    if (rule.symbols.length > 0) {
      if (rule.symbols.every( function (symbol) {
            changed |= page.merge(rule.first, symbol.first);
            return symbol instanceof page.BNF.NT ? symbol.empty : false;
          }))
        changed = rule.empty = true;
      changed |= page.merge(rule.nt.first, rule.first);
      if (!rule.nt.empty && rule.empty) changed = rule.nt.empty = true;
    }
  });
} while (changed);

`follow`

By definition, a follow set contains those terminals of which one must be in the input after we have completed a branch node of the syntax tree, i.e., after we matched a non-terminal.

The follow set of a non-terminal starts out empty.

For each non-empty rule, we start with the current follow set of the non-terminal on its left-hand side and the last symbol on its right-hand side and proceed backwards to the first symbol.

If the symbol is a non-terminal, we add the current set to its follow set.

We then take the first set of the symbol, terminal or not. For empty non-terminals we use the sum of this first and the new follow set. With the result we proceed to the previous symbol in the right-hand side until we reach the beginning of the rule.

This process to determine follow requires iterating over the rules until there are no more changes. Observe that we start out in rule 0 and add $eof to the initially empty follow set for the start symbol. The iteration will terminate for the same reason as the iteration for first. Here is the implementation, part of page.BNF#init:

do {
  changed = false;
  self.rules.forEach( function (rule) {
      if (rule.symbols.length > 0)
        rule.symbols.reduceRight( function (follow, last) {
            if (last instanceof page.BNF.NT) {
              changed |= page.merge(last.follow, follow);
              if (last.empty)
                return Object.assign({}, last.first, follow);
            }
            return last.first;
          }, rule.nt.follow);
    });
} while (changed);

Testing for LL(1)

By definition, a grammar is LL(1) if it satisfies the following two conditions:

The first sets of all those rules are pairwise disjoint which have the same non-terminal on the left-hand side.
If a non-terminal is empty its first set must be disjoint from its follow set.

page.BNF#init computes empty, first, and follow when the representation of a BNF grammar is completed.

Similarly, page.EBNF#init calls page.BNF#init once the EBNF grammar is translated to BNF and then copies empty, first, and follow back to the rules and nodes of the representation of the EBNF grammar.

Finally, a page.LL1 object wraps a grammar for top-down parsing and its constructor implements the LL(1) test.

For example, the BNF grammar for BNF is LL(1):

jjs> new page.LL1(page.BNF.ctor()).toString()

There are no error messages, i.e., the grammar is LL(1) and, therefore, not ambiguous.

The first grammar for addition is left recursive and as such not suitable for recursive descent.

%token Number '[0-9]+';

addition: Number;
addition: addition '+' Number;

A check for LL(1) in the web page produces an error message and the error is noted at page.LL1.ambiguous:

addition has ambiguous alternatives

ambiguous alternatives: 1
[done]

With the option describe the web page additionally uses the dump method to show the various sets:

 0: $accept: addition $eof;
    first: Number
 1: addition: Number;
    first: Number
 2: addition: addition '+' Number;
    first: Number

 4: $accept
    first: Number

 5: addition
    first: Number
    follow: $eof '+'
[done]

page.LL1 could test an EBNF grammar by deferring to the BNF version, but error messages can refer directly to the EBNF constructs if the test is performed by visiting the EBNF rules:

%token Number '[0-9]+';    
addition: ( addition '+' )? Number;

check produces

$2 has overlapping first and follow sets (rep)

[done]

i.e., the optional clause — an iteration represented by page.EBNF.Rep — has a problem. This example shows that left recursion can present as a greedy problem.

Recursive descent

If a grammar is LL(1) recursive descent will unambiguously build syntax trees based on the next input symbol.

We start with the non-terminal $accept and the first input symbol. Note that the terminal $eof will follow all input symbols and that it only appears at the end of the very first rule.

If the current input symbol is in the first set of the current non-terminal it must be in the first set of exactly one of its rules. If not, the input is not a sentence. Consider each symbol in the right-hand side of that rule.
If this symbol is a terminal symbol it must be the current input symbol or the input is not a sentence; this completes a leaf node of the syntax tree. If it is $eof we have found a sentence and we are done; this completes the branch node for $accept. Otherwise move on to the next input symbol and the next symbol in the rule.
If the current symbol is a non-terminal symbol the current input symbol must be in its first set. If it is, recursively consider the new non-terminal, beginning at step 1. If it is not and the non-terminal is not empty the input is not a sentence.
If the non-terminal is empty, or if the recursion to the non-terminal returns, move on to the next symbol in the rule. Complete a branch node when the last symbol on the right-hand side of the current rule has been processed; then return from recursion.

Note that LL(1) guarantees that there is no ambiguity: The first sets are disjoint; therefore, only one rule can be picked in step 1. For an empty non-terminal, the first and follow sets are disjoint; therefore, we cannot voluntarily skip step 3 in favor of step 4.

Dealing with ambiguity

If the relevant first sets are not pairwise disjoint we have no reasonable fix for step 1 above. Therefore, we count this grammar issue as ambiguous. Note that, in particular, a left recursive grammar can have this problem:

addition: Number;
addition: bang more;
bang:     '!';
more:     addition '+' Number;

This grammar describes lists of numbers separated by + which must be prefixed by as many exclamation points as there are + operators, e.g., ! ! 1 + 2 + 3. The grammar is left-recursive (and ambiguos) as soon as bang can be empty.

If the relevant first and follow sets are not disjoint, and if the current input symbol appears in both, we have a choice between step 3 and step 4 above. We can build a taller tree if we choose step 3 because we build another branch node for a new non-terminal rather than moving on in the current rule (and branch node). Therefore, we count this grammar issue as greedy. Here is a classic example, the dangling else problem:

%token Number '[0-9]+';  

statement: Number;
statement: 'IF' Number 'ELSE' statement else;
else:      'ELSE' statement;
else:      ;

check shows

else has overlapping first and follow sets
[done]

The non-terminal else starts with ELSE but can also be followed by ELSE. Programming languages often allow this and state that ELSE is silently considered part of the innermost IF statement.

In our algorithm above this means preferring step 3 over step 4, i.e., acting greedy by accepting as much input as possible before moving on in an outer rule.

Generally speaking, recursive descent should only be based on LL(1) grammars. ambiguous issues should be rejected because they could indicate left recursion. greedy issues can often be accepted if they are more convenient to describe a language.

The idea

Definitions

empty

first

follow