Syntax trees

Terminology

As a first definition, a sentence is a sequence of terminal symbols, i.e., literals and tokens.

A language is a — possibly infinite — set of sentences. The sentences have finite but usually unbounded length.

A grammar generates syntax trees: The start symbol is the root of every syntax tree. Branch nodes are labeled with non-terminals; the non-terminal and the ordered sequence of descendants of the branch node must be a rule in the grammar. A leaf node is labeled with a terminal, or with a non-terminal which has a rule with an empty right-hand side; technically this is a branch node without a descendant.

Finally, as a second definition, a sentence is the ordered sequence of terminals which label the leaf nodes of a syntax tree. For a given grammar, every syntax tree produces a sentence and every sentence must have at least one syntax tree.

A grammar is a finite generator for a language because it has finite sets of terminals, non-terminals, and rules; each rule must have a finite length, too. There can be more than one grammar for the same language, but all grammars and the language share the same set of terminals.

Example

For example, consider the grammar

%token Number '[0-9]+';

input: ;
input: addition;
addition: Number;
addition: Number '+' addition;

Here are syntax trees for sentences with zero to three Number elements:

syntax trees

All root nodes are labeled with the start symbol input.

All leaf nodes are labeled with terminal symbols, except for the root node of the first tree — input has an empty right-hand side.

All other branch nodes are labeled with addition and will either have a single Number as a descendant, i.e., a leaf node, or they will have three descendants, namely the leaf nodes Number and + and another branch node addition.

Therefore, the grammar generates the language of sequences of Number separated by +, including an empty sentence.

Note that there is only one syntax tree for each sentence: addition with a single leaf node can only appear at the bottom right of the tree, all other branch nodes can only be addition with two leaf nodes and another addition.

Ambiguous grammars

As another example, consider the slightly modified grammar

%token Number '[0-9]+';

input: ;
input: addition;
addition: Number;
addition: Number '+' addition;      # rule 4
addition: addition '+' Number;      # rule 5, newly added
addition: addition '+' addition;    # rule 6, newly added
addition: Number '+' Number;        # rule 7, newly added

Root nodes are labeled input as before. Branch nodes will be labeled with addition and will either have a single leaf node Number as a descendant, or they will have three descendants, where the leaf node + is between two nodes which are either Number or more branch nodes. Therefore, this grammar generates the same language of sequences of Number separated by +, including an empty sentence.

Here are a few syntax trees for sentences with two to four Number elements:

syntax trees

syntax trees

syntax trees

The grammar above is called ambiguous because there is at least one sentence for which more than one syntax tree exists. For this particular grammar there are very many such sentences, but in general, it can be difficult for a given grammar to find such a sentence, or to prove that none exists.

What's with ambiguity?

The point of a grammar is mainly that it is a finite description of a language which can be an infinite set. We cannot search such a set to prove that a given sequence of terminal symbols actually is a sentence, but computing a syntax tree using a grammar is usually quite simple.

The two examples illustrate that there can be different grammars generating the same language. It is often difficult to find a non-ambiguous grammar for the same set of sentences. In this particular case it is easy: addition above needs only one of rules 4 through 6 which involve + to generate the same language and rule 7 is just a special case and as such superfluous. Choosing either rule 4 or rule 5 makes the grammar non-ambiguous, as explained above.

If only rule 6 is included, however, the grammar is still ambiguous:

syntax trees

Ambiguity of a grammar is only an issue if the meaning of a sentence is defined in a way that involves a syntax tree. For example, if an addition is supposed to be executed as a left-to-right post-order traversal of the syntax tree (where branch nodes are visited after leaf nodes), rounding and integer addition using these two syntax trees can produce different results for the following input:

1 + 1.6 + 1.6 + 1

The left tree computes 2.6 + 2.6 whereas the right tree computes 1 + 3.2 + 1. With rounding this amounts to 3 + 3, i.e., 6, and 1 + 3 + 1, i.e., 5!

Initial grammar checks

BNF itself has hardly any restrictions on the rules of a grammar. E.g., there can be two rules with empty right-hand sides for the same non-terminal, or two identical rules — this does no affect the syntax trees at all. However, when a grammar construction is concluded with page.BNF#init, the following are checked:

  • All names must be uniquely defined, either as token names using %token, or as non-terminal names which must appear on the left-hand side of one or more rules.

    %token duplicate 'dup';
    %token duplicate 'dup';
    duplicate: none;
    
    error: (2) duplicate is already defined
    error: (3) duplicate is already defined as a token
    error: none is not defined by a rule

    This is checked in the factory methods when new tokens and new non-terminals are created, and a loop over the non-terminals eventually makes sure that they all have rules.

  • All non-terminals should be necessary, i.e., they must be reachable through a chain of right-hand sides from the internal start symbol $accept.

    start: needed;
    needed: '!';
    not-needed: '?';
    
    error: not-needed cannot be reached from rule zero

    First, beginning with rule 0, rules and their non-terminals are marked by recursively tracing all unmarked rules for all non-terminals on their right-hand side. This iteration expands over a finite amount of data. Finally, a loop over the non-terminals makes sure that they all can be reached.

  • If a grammar is used to generate syntax trees non-terminals must have a way to produce terminals.

    a: 'ok';
    a: b;
    b: c;
    c: b;
    
    error: b is infinitely recursive
    error: c is infinitely recursive

    First, repeatedly for each unmarked rule each symbol on the right-hand side is checked: if there is a terminal or a marked non-terminal, the rule and the non-terminal on the left-hand side are marked. The iteration expands over a finite amount of data and is halted if there is no change. Finally, a loop over the non-terminals makes sure that they have all been marked, i.e., are able to produce at least one terminal.

Next: Extended BNF Previous: Dissecting input