Terminology
As a first definition, a sentence is a sequence of terminal symbols, i.e., literals and tokens.
A language is a — possibly infinite — set of sentences. The sentences have finite but usually unbounded length.
A grammar generates syntax trees: The start symbol is the root of every syntax tree. Branch nodes are labeled with non-terminals; the non-terminal and the ordered sequence of descendants of the branch node must be a rule in the grammar. A leaf node is labeled with a terminal, or with a non-terminal which has a rule with an empty right-hand side; technically this is a branch node without a descendant.
Finally, as a second definition, a sentence is the ordered sequence of terminals which label the leaf nodes of a syntax tree. For a given grammar, every syntax tree produces a sentence and every sentence must have at least one syntax tree.
A grammar is a finite generator for a language because it has finite sets of terminals, non-terminals, and rules; each rule must have a finite length, too. There can be more than one grammar for the same language, but all grammars and the language share the same set of terminals.
Example
For example, consider the grammar
%token Number '[0-9]+';
input: ;
input: addition;
addition: Number;
addition: Number '+' addition;
Here are syntax trees for sentences with zero to three Number
elements:
All root nodes are labeled with the start symbol input
.
All leaf nodes are labeled with terminal symbols,
except for the root node of the first tree — input
has an empty right-hand side.
All other branch nodes are labeled with addition
and will either have a single Number
as a descendant, i.e., a leaf node,
or they will have three descendants, namely the leaf nodes Number
and +
and another branch node addition
.
Therefore, the grammar generates the language of sequences of Number
separated by +
, including an empty sentence.
Note that there is only one syntax tree for each sentence:
addition
with a single leaf node can only appear at the bottom right of the tree,
all other branch nodes can only be addition
with two leaf nodes and another addition
.
Ambiguous grammars
As another example, consider the slightly modified grammar
%token Number '[0-9]+';
input: ;
input: addition;
addition: Number;
addition: Number '+' addition; # rule 4
addition: addition '+' Number; # rule 5, newly added
addition: addition '+' addition; # rule 6, newly added
addition: Number '+' Number; # rule 7, newly added
Root nodes are labeled input
as before.
Branch nodes will be labeled with addition
and will either have a single leaf node Number
as a descendant,
or they will have three descendants, where the leaf node +
is between two nodes
which are either Number
or more branch nodes.
Therefore, this grammar generates the same language of sequences of Number
separated by +
, including an empty sentence.
Here are a few syntax trees for sentences with two to four Number
elements:
The grammar above is called ambiguous because there is at least one sentence for which more than one syntax tree exists. For this particular grammar there are very many such sentences, but in general, it can be difficult for a given grammar to find such a sentence, or to prove that none exists.
What's with ambiguity?
The point of a grammar is mainly that it is a finite description of a language which can be an infinite set. We cannot search such a set to prove that a given sequence of terminal symbols actually is a sentence, but computing a syntax tree using a grammar is usually quite simple.
The two examples illustrate that there can be different grammars generating the same language.
It is often difficult to find a non-ambiguous grammar for the same set of sentences.
In this particular case it is easy:
addition
above needs only one of rules 4 through 6 which involve +
to generate the same language
and rule 7 is just a special case and as such superfluous.
Choosing either rule 4 or rule 5 makes the grammar non-ambiguous,
as explained above.
If only rule 6 is included, however, the grammar is still ambiguous:
Ambiguity of a grammar is only an issue
if the meaning of a sentence is defined in a way that involves a syntax tree.
For example, if an addition
is supposed to be executed as a left-to-right post-order traversal
of the syntax tree (where branch nodes are visited after leaf nodes),
rounding and integer addition using these two syntax trees can produce different results for the following input:
1 + 1.6 + 1.6 + 1
The left tree computes 2.6 + 2.6
whereas the right tree computes 1 + 3.2 + 1
.
With rounding this amounts to 3 + 3
, i.e., 6
, and 1 + 3 + 1
, i.e., 5
!
Initial grammar checks
BNF itself has hardly any restrictions on the rules of a grammar. E.g., there can be two rules with empty right-hand sides for the same non-terminal, or two identical rules — this does no affect the syntax trees at all. However, when a grammar construction is concluded with page.BNF#init, the following are checked:
All names must be uniquely defined, either as token names using
%token
, or as non-terminal names which must appear on the left-hand side of one or more rules.%token duplicate 'dup'; %token duplicate 'dup'; duplicate: none; error: (2) duplicate is already defined error: (3) duplicate is already defined as a token error: none is not defined by a rule
This is checked in the factory methods when new tokens and new non-terminals are created, and a loop over the non-terminals eventually makes sure that they all have rules.
All non-terminals should be necessary, i.e., they must be reachable through a chain of right-hand sides from the internal start symbol
$accept
.start: needed; needed: '!'; not-needed: '?'; error: not-needed cannot be reached from rule zero
First, beginning with rule 0, rules and their non-terminals are marked by recursively tracing all unmarked rules for all non-terminals on their right-hand side. This iteration expands over a finite amount of data. Finally, a loop over the non-terminals makes sure that they all can be reached.
If a grammar is used to generate syntax trees non-terminals must have a way to produce terminals.
a: 'ok'; a: b; b: c; c: b; error: b is infinitely recursive error: c is infinitely recursive
First, repeatedly for each unmarked rule each symbol on the right-hand side is checked: if there is a terminal or a marked non-terminal, the rule and the non-terminal on the left-hand side are marked. The iteration expands over a finite amount of data and is halted if there is no change. Finally, a loop over the non-terminals makes sure that they have all been marked, i.e., are able to produce at least one terminal.