The idea
Parsing means to analyze if a sequence of terminal symbols is a sentence of a language. If it is there has to be a syntax tree. Top-down parsing tries to build the syntax tree from the root to the leaf nodes, i.e., beginning with the start symbol of a BNF grammar for the language as the branch node at the root, through non-terminals as further branch nodes, and finally arriving at the terminals as leaf nodes. The strategy is called recursive descent because the syntax tree is built during recursive visits to non-terminals, descending from the root to the leaves.
The grammar is viewed as a set of functions, one per non-terminal. A function body combines all the rules for a non-terminal, i.e., it contains a set of right-hand sides. Each right-hand side is viewed as a sequence of instructions, one per symbol. A terminal symbol calls for a match with input and results in a leaf node. A non-terminal symbol results in a call to the non-terminals function which builds a branch node. For example:
0 $accept: addition $eof;
1 addition: Number more;
2 more: '+' Number more;
3 more: ;
Rule 0 requires that addition
exactly match the entire input.
This is the only rule for the internal start symbol $accept
,
and the only right-hand side contains addition
and a literal $eof
for the end of input.
addition
is a non-terminal which is interpreted as a function call to build a branch node of the syntax tree.
We reach rule 1 with the only right-hand side for addition
and it starts with the token Number
.
Therefore, we can only continue if the first terminal symbol in the input is a Number
.
If so, we have a leaf node as first descendant of the branch node for addition
.
We now need to match the next symbol in this right-hand side, i.e., more
.
Because more
is a non-terminal, it acts as a function call
and we need to match one of the right-hand sides for more
.
The first one, in rule 2, starts with the literal +
.
If the next input symbol is +
, we must continue with the selected right-hand side
which now needs to find another Number
followed by a (recursive) match for another more
.
Later the two leaf nodes and the branch node for more
will be used as descendants for a branch node labeled more
.
Eventually we will run out of +
, i.e., we cannot match rule 2 for more
any longer.
Therefore, we look for other right-hand sides for more
, if any.
For more
there is rule 3 and its right-hand side is empty, i.e., it will match and not require any input.
This results in a branch node for more
without a descendant.
The recursive calls end and we are back to rule 1 which can build a branch node for addition
with a Number
leaf node and a more
branch node as descendants.
Back to rule 0, or rather the function for $accept
which now needs to match $eof
,
i.e., the end of input.
In other words, if the input exactly contains one or more Number
separated by +
this process proves existence of a syntax tree for the input, i.e., these kinds of input are sentences.
Definitions
Unfortunately, the strategy above does not work for all grammars:
addition: addition '+' Number;
addition: Number;
If a rule is left-recursive over one or more levels we end up in an infinite recursion.
addition: Number;
addition: Number '+' addition;
Right-hand sides must not start with the same terminal because we would not know which right-hand side to use — the process cannot backtrack to unmatch already matched input.
addition: more Number;
more: Number '+' more;
more: ;
If a rule has an empty right-hand side
all other of its right-hand sides cannot start with a symbol which can also follow input matched by the rule
because here the function for more
would not know whether to match a Number
or leave it for its caller addition
.
LL(1) formalizes these observations. It is a sufficient condition for a grammar to be non-ambiguous and certifies the grammar as suitable for input checking by recursive descent as described above. The name indicates the tree construction strategy: left-to-right with one symbol lookahead.
LL(1) can be checked by inspection of a BNF grammar.
Two sets, first
and follow
, are defined for symbols and rules of the grammar.
LL(1) then consists of two conditions on these sets as described in the next section below.
empty
In order to compute the sets we also define a condition empty
for non-terminals and rules.
A rule is empty
if its right-hand side contains no symbols.
The non-terminal on the left-hand side of this rule is empty
, too.
So far, empty
can be computed simply by inspecting the grammar.
first
By definition, a first
set contains those terminal symbols of which one must be in the input
when we start to build a branch node of a syntax tree, i.e., when we try to find a rule and match a non-terminal or terminal of the grammar.
The first
set of a terminal consists of the terminal itself — if we try to match a terminal in the grammar we need it in the input.
The first
set of a non-terminal starts out empty.
The first
set of a rule starts with the first
set of the first symbol in the right-hand side.
As long as this symbol is an empty
non-terminal
we proceed to the next symbol on the right-hand side and add its first
set to the first
set of the rule
and the non-terminal on the left-hand side.
If all symbols on the right-hand side are empty
non-terminals,
we call the rule and the non-terminal on the left-hand side, both, empty
.
This definition of first
and empty
requires iterating over the rules until there are no more changes.
The iteration will terminate because there is a finite number of terminals, non-terminals, and rules,
and all sets can at most contain a finite number of terminals.
Here is the implementation, part of page.BNF#init:
do {
changed = false;
this.rules.forEach( function (rule) {
if (rule.symbols.length > 0) {
if (rule.symbols.every( function (symbol) {
changed |= page.merge(rule.first, symbol.first);
return symbol instanceof page.BNF.NT ? symbol.empty : false;
}))
changed = rule.empty = true;
changed |= page.merge(rule.nt.first, rule.first);
if (!rule.nt.empty && rule.empty) changed = rule.nt.empty = true;
}
});
} while (changed);
follow
By definition, a follow
set contains those terminals of which one must be in the input
after we have completed a branch node of the syntax tree,
i.e., after we matched a non-terminal.
The follow
set of a non-terminal starts out empty.
For each non-empty
rule, we start with the current follow
set of the non-terminal on its left-hand side
and the last symbol on its right-hand side and proceed backwards to the first symbol.
If the symbol is a non-terminal, we add the current set to its follow
set.
We then take the first
set of the symbol, terminal or not.
For empty
non-terminals we use the sum of this first
and the new follow
set.
With the result we proceed to the previous symbol in the right-hand side until we reach the beginning of the rule.
This process to determine follow
requires iterating over the rules until there are no more changes.
Observe that we start out in rule 0 and add $eof
to the initially empty follow
set for the start symbol.
The iteration will terminate for the same reason as the iteration for first
.
Here is the implementation, part of page.BNF#init:
do {
changed = false;
self.rules.forEach( function (rule) {
if (rule.symbols.length > 0)
rule.symbols.reduceRight( function (follow, last) {
if (last instanceof page.BNF.NT) {
changed |= page.merge(last.follow, follow);
if (last.empty)
return Object.assign({}, last.first, follow);
}
return last.first;
}, rule.nt.follow);
});
} while (changed);
Testing for LL(1)
By definition, a grammar is LL(1) if it satisfies the following two conditions:
The
first
sets of all those rules are pairwise disjoint which have the same non-terminal on the left-hand side.If a non-terminal is
empty
itsfirst
set must be disjoint from itsfollow
set.
page.BNF#init computes empty
, first
, and follow
when the representation
of a BNF grammar is completed.
Similarly, page.EBNF#init calls page.BNF#init once the EBNF grammar is translated to BNF
and then copies empty
, first
, and follow
back to the rules and nodes of the representation of the EBNF grammar.
Finally, a page.LL1 object wraps a grammar for top-down parsing and its constructor implements the LL(1) test.
For example, the BNF grammar for BNF is LL(1):
jjs> new page.LL1(page.BNF.ctor()).toString()
There are no error messages, i.e., the grammar is LL(1) and, therefore, not ambiguous.
The first grammar for addition is left recursive and as such not suitable for recursive descent.
%token Number '[0-9]+';
addition: Number;
addition: addition '+' Number;
A check for LL(1) in the web page produces an error message and the error is noted at page.LL1.ambiguous
:
addition has ambiguous alternatives
ambiguous alternatives: 1
[done]
With the option describe the web page
additionally uses the dump
method to show the various sets:
0: $accept: addition $eof;
first: Number
1: addition: Number;
first: Number
2: addition: addition '+' Number;
first: Number
4: $accept
first: Number
5: addition
first: Number
follow: $eof '+'
[done]
page.LL1 could test an EBNF grammar by deferring to the BNF version, but error messages can refer directly to the EBNF constructs if the test is performed by visiting the EBNF rules:
%token Number '[0-9]+';
addition: ( addition '+' )? Number;
check produces
$2 has overlapping first and follow sets (rep)
[done]
i.e., the optional clause — an iteration represented by page.EBNF.Rep — has a problem.
This example shows that left recursion can present as a greedy
problem.
Recursive descent
If a grammar is LL(1) recursive descent will unambiguously build syntax trees based on the next input symbol.
We start with the non-terminal $accept
and the first input symbol.
Note that the terminal $eof
will follow all input symbols and that it only appears at the end of the very first rule.
If the current input symbol is in the
first
set of the current non-terminal it must be in thefirst
set of exactly one of its rules. If not, the input is not a sentence. Consider each symbol in the right-hand side of that rule.If this symbol is a terminal symbol it must be the current input symbol or the input is not a sentence; this completes a leaf node of the syntax tree. If it is
$eof
we have found a sentence and we are done; this completes the branch node for$accept
. Otherwise move on to the next input symbol and the next symbol in the rule.If the current symbol is a non-terminal symbol the current input symbol must be in its
first
set. If it is, recursively consider the new non-terminal, beginning at step 1. If it is not and the non-terminal is notempty
the input is not a sentence.If the non-terminal is
empty
, or if the recursion to the non-terminal returns, move on to the next symbol in the rule. Complete a branch node when the last symbol on the right-hand side of the current rule has been processed; then return from recursion.
Note that LL(1) guarantees that there is no ambiguity:
The first
sets are disjoint; therefore, only one rule can be picked in step 1.
For an empty
non-terminal, the first
and follow
sets are disjoint;
therefore, we cannot voluntarily skip step 3 in favor of step 4.
Dealing with ambiguity
If the relevant first
sets are not pairwise disjoint we have no reasonable fix for step 1 above.
Therefore, we count this grammar issue as ambiguous
.
Note that, in particular, a left recursive grammar can have this problem:
addition: Number;
addition: bang more;
bang: '!';
more: addition '+' Number;
This grammar describes lists of numbers separated by +
which must be prefixed by as many exclamation points
as there are +
operators, e.g., ! ! 1 + 2 + 3
.
The grammar is left-recursive (and ambiguos
) as soon as bang
can be empty.
If the relevant first
and follow
sets are not disjoint,
and if the current input symbol appears in both,
we have a choice between step 3 and step 4 above.
We can build a taller tree if we choose step 3 because we build another branch node for a new non-terminal
rather than moving on in the current rule (and branch node).
Therefore, we count this grammar issue as greedy
.
Here is a classic example, the dangling else problem:
%token Number '[0-9]+';
statement: Number;
statement: 'IF' Number 'ELSE' statement else;
else: 'ELSE' statement;
else: ;
check shows
else has overlapping first and follow sets
[done]
The non-terminal else
starts with ELSE
but can also be followed by ELSE
.
Programming languages often allow this and state that ELSE
is silently considered
part of the innermost IF
statement.
In our algorithm above this means preferring step 3 over step 4, i.e., acting greedy by accepting as much input as possible before moving on in an outer rule.
Generally speaking, recursive descent should only be based on LL(1) grammars.
ambiguous
issues should be rejected because they could indicate left recursion.
greedy
issues can often be accepted if they are more convenient to describe a language.