Tracing
A trace of parser execution was useful, e.g., to judge the quality of error recovery:
error: (5) ']' is not allowed in "'(' argument */ ',' ')'"
expecting: ')'
4 (5) ']' reduce parentheses: '(' argum obj one,2
4 (5) ']' shift ']' obj null
4 (5) '.' shift '.' obj null
3 (6) '}' reduce brackets: '[' parenthe obj one,2
Tracing is implemented using the Observer design pattern: A subject is informed about an observer and sends back messages as the state of the subject changes. The page.LL1 object plays the role of subject which calls an Observer function. The initial connection is either made when the parser is constructed or when one of the parsing methods is called.
For top-down parsing with either, page.LL1#parse or page.LL1#parser, there are four messages describing state changes:
rule
indicates that the parser starts to match a new rule.shift
indicates that a terminal symbol has been matched.reduce
indicates that the parser is done matching a rule.error
, finally, indicates that the parser could not match the current input symbol.
Each message is accompanied by additional information:
a number indicating a position in the grammar which could be a rule number,
the current input page.Tuple,
a page.BNF.Rule or page.BNF.T object as context from the grammar, or an error message string,
and a reference to the subject sending the message which could be queried for other information.
Tracing is most convenient if it can be added quickly at any time. Therefore, page has a factory method page.trace to create observers which will simply print the messages.
For example, the stack algorithm for the error recovery example can be traced as follows:
ll1 = new page.LL1(page.EBNF.from( Grammar.value ), page.trace());
input = ll1.tokenizer()( Input.value );
ll1.parse(input.concat([null]));
If the trace of the recursive algorithm needs a new header, a fresh observer can be provided when the algorithm is invoked:
ll1.parser(input, page.trace(/reduce/));
The trace can be restricted to messages selected by a pattern argument to page.trace.
Collecting values
So far, input symbols are abandoned as soon as they are matched during parsing.
An observer can change that because it sees all matched tuples during shift
messages:
collect = [];
function observer (_at, _tuple, _action, _info, _sender) {
if (_action == 'shift') collect.push(_tuple.value);
return collect;
}
For example, the following EBNF grammar describes a sum as a sequence of numbers separated by plus or minus signs:
%token Number '[0-9]+';
sum: Number +/ ('+' | '-');
The observer can be used to collect all input symbols:
ll1 = new page.LL1(page.EBNF.from( Grammar.value ), observer);
input = ll1.tokenizer()( '1 + 3 - 5 + 7' );
ll1.parser(input).join(' . ')
By convention, the parsing algorithm returns the last value which it received from the observer. Here this is
1 . + . 3 . - . 5 . + . 7 .
These are the input symbols (pushed into the array collect
above), joined with blanks and periods for display purposes.
The trailing period is caused by matching and collecting $eof
.
page.trace accepts an observer as a first argument and observes its result:
ll1.parser(input, page.trace(observer))
produces
AT TUPLE ACTION RETURN
0 (2) Number '1' rule $accept: sum $eof; [ ]
1 (2) Number '1' rule sum: Number +/ ('+' | ' [ ]
1 (2) Number '1' shift Number [ ] 1
1 (2) '+' shift '+' [ ] 1,+
1 (2) Number '3' shift Number [ ] 1,+,3
1 (2) '-' shift '-' [ ] 1,+,3,-
1 (2) Number '5' shift Number [ ] 1,+,3,-,5
1 (2) '+' shift '+' [ ] 1,+,3,-,5,+
1 (2) Number '7' shift Number [ ] 1,+,3,-,5,+,7
1 (0) $eof reduce sum: Number +/ ('+' | ' [ ] 1,+,3,-,5,+,7
1 (0) $eof shift $eof [ ] 1,+,3,-,5,+,7,
0 (0) $eof reduce $accept: sum $eof; [ ] 1,+,3,-,5,+,7,
1,+,3,-,5,+,7,
The RETURN
column shows how collect
acquires one matched input symbol after another.
Collecting values turns out to be so useful for further processing that it is built into page as well. page.LL1#reflect returns an observer which collects those input strings which were matched by page.BNF.Token patterns:
ll1.parser(input, ll1.reflect());
produces a flat array
1,3,5,7
There is a significant difference between the two built-in observers: page.trace is a static function and produces a state-less observer, page.LL1#reflect is a method of a parser object because the observer has state — the collected values. Either observer is serially (but not concurrently) reusable.
Semantic actions
What can be done with the collected values?
The sum
example hints that one could try to build a calculator.
However, by design choice page.LL1#reflect does not collect literal operators.
Here is a trace for a slightly different grammar:
%token Number '[0-9]+';
sum: Number (add | sub)*;
add: '+' Number;
sub: '-' Number;
The reduce
messages from parsing '1 + 3 - 5 + 7'
AT TUPLE ACTION RETURN
2 (2) '-' reduce add: '+' Number; [ ] 3
3 (2) '+' reduce sub: '-' Number; [ ] 5
2 (0) $eof reduce add: '+' Number; [ ] 7
1 (0) $eof reduce sum: Number (add | sub) [ ] 1,3,5,7
0 (0) $eof reduce $accept: sum $eof; [ ] 1,3,5,7
show how each but the first Number
is collected in a reduce
message for a rule which is specific to the desired arithmetic operation.
The first Number
is collected in rule 1, near the end of the trace.
Note that the value collected for a token is a string from the input.
To avoid a lot of explicit conversions, the grammar is changed so that Number
appears in only one rule:
sum: num (add | sub)*;
add: '+' num;
sub: '-' num;
num: Number;
Given a collection of functions as an argument, page.LL1#reflect can call a Builder function corresponding to the name of the non-terminal on the left-hand side:
actions = {
num: function (values) { return Number(values[0]); },
add: function (values, env) { return env[0] += values[0]; },
sub: function (values, env) { return env[0] -= values[0]; },
sum: function (values, env) { return values[0] + env[0]; }
};
If a Builder function with the name of a rules left-hand side can be found, it receives
the list of values collected for the rule,
an environment object originally passed to
reflect
,and a reference to the parser which can be queried for the current input page.Tuple.
The function can perform arbitrary actions. It is known as a semantic action because it defines the meaning of the grammar rule.
The function result
is collected by reflect
into the list for the encompassing rule; an array result is flattened.
The function num
shown above converts the input string representing Number
into a numerical value
and returns that to reflect
for the outer rules. The grammar rule for num
above was added
exactly to place the conversion into a single Builder function.
add
and sub
assume that env
is an array, initialized with a single element of value zero,
and accumulate the running total in that array element.
Finally, sum
combines the value of the first Number
with the running total in the environment
and returns the result.
The environment is initially handed to reflect
, together with the Builder function collection:
ll1.parser(input, page.trace( ll1.reflect([0], actions), /reduce/));
The reduce
messages from the trace show how the numbers are collected until the total 6
is computed.
AT TUPLE ACTION RETURN
4 (2) '+' reduce num: Number; num 1
4 (2) '-' reduce num: Number; num 3
2 (2) '-' reduce add: '+' num; num 3
4 (2) '+' reduce num: Number; num 5
3 (2) '+' reduce sub: '-' num; num -2
4 (0) $eof reduce num: Number; num 7
2 (0) $eof reduce add: '+' num; num 5
1 (0) $eof reduce sum: num (add | sub)*; num 6
0 (0) $eof reduce $accept: sum $eof; [ ] 6
6
The trace shows that all the return values passed through reflect
are numerical values.
However, there is no semantic action for rule 0; therefore, the final value returned from reflect
is an array containing the last numerical value.
This example should suffice to explain why by design choice reflect
does not collect literal input strings:
they are already known in the grammar. Only strings representing token patterns are interesting for further processing.
Experience shows that the grammar can always be (re-)written so that a Builder function knows exactly
what literals and how many token values it deals with.
Similarly, while a Builder function can return an array with zero or more values to reflect
,
and while these values are flattened into the list for the next outer rule,
the next outer Builder function knows from its corresponding grammar rule that it receives values for a specific non-terminal
or EBNF repetition construct and should be able to deal even with a variable number of such values.
As an example, consider the fact that, technically, arithmetic is not quite performed from left to right with the Builder functions above. The following semantic actions eliminate this issue:
actions = {
num: function (values) { return Number(values[0]); },
sub: function (values) { return - values[0]; },
sum: function (values) {
return values.slice(1).reduce(
function (left, right) { return left + right; },
values[0]
);
}
};
print( ll1.parser(input, page.trace(ll1.reflect(null, actions), /reduce/)) )
This time the environment is not needed; therefore, null
is sent to reflect
initially.
The trace shows that the add
rule relies on reflect
to hand the value to the outer rule
and that the sub
rule merely changes the sign of the numerical value.
AT TUPLE ACTION RETURN
4 (2) '+' reduce num: Number; num 1
4 (2) '-' reduce num: Number; num 3
2 (2) '-' reduce add: '+' num; [ ] 3
4 (2) '+' reduce num: Number; num 5
3 (2) '+' reduce sub: '-' num; num -5
4 (0) $eof reduce num: Number; num 7
2 (0) $eof reduce add: '+' num; [ ] 7
1 (0) $eof reduce sum: num (add | sub)*; num 6
0 (0) $eof reduce $accept: sum $eof; [ ] 6
6
The actual computation takes place in the sum
function which receives the first number
and any number of additional summands.
While right recursion is used to implement the iteration
in the grammar rule for sum
, arithmetic is still performed from left to right,
courtesy of the JavaScript reduce
method for Array
objects.
Interpreting arithmetic expressions
The previous example evaluates strictly from left to right which is sufficient for addition mixed with subtraction. Full-scale arithmetic, however, requires operator precedence and parentheses. The grammar has to be extended:
%token Number '[0-9]+';
expr: prod (add | sub)*;
add: '+' prod;
sub: '-' prod;
prod: factor (mul | div)*;
mul: '*' factor;
div: '/' factor;
factor: prim power*;
power: '**' prim;
prim: num | '(' expr ')';
num: Number;
Commonly, exponentiation is evaluated from right to left and takes precedence over the other binary operations which are evaluated from left to right. Multiplication and division take precedence over addition and subtraction. Before the addition in
2 + 3 * 4
can happen, the rule for prod
has to be reduced
and the observer created by page.LL1#reflect will execute the corresponding function
which can perform the multiplication prior to the addition.
Here are the necessary Builder functions for a full-scale interpreter:
actions = {
expr: function (values) {
return values.slice(1).reduce(
function (left, right) { return left + right; }, values[0]);
},
sub: function (values) { return - values[0]; },
prod: function (values) {
return values.slice(1).reduce(
function (left, right) { return left * right; }, values[0]);
},
div: function (values) { return 1 / values[0]; },
factor: function (values) {
return values.reduceRight(
function (right, left) { return Math.pow(left, right); }, values.pop());
},
num: function (values) { return Number(values[0]); }
};
The functions for expr
and prod
perform left-to-right evaluation.
It is debatable if creating an inverse in the function for div
for later multiplication in prod
is numerically very sound, but at least JavaScript performs floating point arithmetic
and will not crash if a division by zero is attempted.
Alternatively, the observer could throw an error message to abort interpretation.
Right associativity is implemented using the JavaScript reduceRight
method
for Array
objects
which, however, presents the operands in reverse order for exponentiation with Math.pow()
.
The trace shows how the computation of
1 + 2*3 - 45/(1 + 2 ** 3)
progresses:
AT TUPLE ACTION RETURN
10 (2) '+' reduce num: Number; num 1
9 (2) '+' reduce prim: num | '(' expr ') [ ] 1
7 (2) '+' reduce factor: prim power*; num 1
4 (2) '+' reduce prod: factor (mul | div num 1
10 (2) '*' reduce num: Number; num 2
9 (2) '*' reduce prim: num | '(' expr ') [ ] 2
7 (2) '*' reduce factor: prim power*; num 2
10 (2) '-' reduce num: Number; num 3
9 (2) '-' reduce prim: num | '(' expr ') [ ] 3
7 (2) '-' reduce factor: prim power*; num 3
5 (2) '-' reduce mul: '*' factor; [ ] 3
4 (2) '-' reduce prod: factor (mul | div num 6
2 (2) '-' reduce add: '+' prod; [ ] 6
10 (2) '/' reduce num: Number; num 45
9 (2) '/' reduce prim: num | '(' expr ') [ ] 45
7 (2) '/' reduce factor: prim power*; num 45
10 (2) '+' reduce num: Number; num 1
9 (2) '+' reduce prim: num | '(' expr ') [ ] 1
7 (2) '+' reduce factor: prim power*; num 1
4 (2) '+' reduce prod: factor (mul | div num 1
10 (2) '**' reduce num: Number; num 2
9 (2) '**' reduce prim: num | '(' expr ') [ ] 2
10 (2) ')' reduce num: Number; num 3
9 (2) ')' reduce prim: num | '(' expr ') [ ] 3
8 (2) ')' reduce power: '**' prim; [ ] 3
7 (2) ')' reduce factor: prim power*; num 8
4 (2) ')' reduce prod: factor (mul | div num 8
2 (2) ')' reduce add: '+' prod; [ ] 8
1 (2) ')' reduce expr: prod (add | sub)* num 9
9 (0) $eof reduce prim: num | '(' expr ') [ ] 9
7 (0) $eof reduce factor: prim power*; num 9
6 (0) $eof reduce div: '/' factor; num 0.1111111111111111
4 (0) $eof reduce prod: factor (mul | div num 5
3 (0) $eof reduce sub: '-' prod; num -5
1 (0) $eof reduce expr: prod (add | sub)* num 2
0 (0) $eof reduce $accept: expr $eof; [ ] 2
2
The trace shows that rule 4 — the product of 2
and 3
— is reduced long before rule 1 — the overall sum — which has lowest precedence
and is reduced just before rule 0 accepts the expression.
The effect of the parentheses and precedence of **
over +
can be seen by the order in which
rule 7 — 2
to the third power, rule 1 — the sum of 1
and 2 ** 3
, and rule 4 — the quotient — are reduced.
It should be noted that the same set of Builder functions always works for both, the recursive page.LL1#parser algorithm and the stack-based page.LL1#parse algorithm. The latter is strictly based on a BNF grammar, i.e., value collection works equally well for grammars described using BNF or extended BNF.
Representing arithmetic expressions
An arithmetic expression is often represented as a tree with the operations in the branch nodes and the numbers in the leaves. Lisp's prefix notation is a good way to display such a tree. The following Builder functions create Lisp notation from an arithmetic expression which is a sentence of the grammar shown before.
actions = {
expr: fromLeft,
add: function (values) { return { op: '+', val: values[0] }; },
sub: function (values) { return { op: '-', val: values[0] }; },
prod: fromLeft,
mul: function (values) { return { op: '*', val: values[0] }; },
div: function (values) { return { op: '/', val: values[0] }; },
factor: fromRight,
num: function (values) { return Number(values[0]); }
};
As before, num
converts all matched Number
tokens into JavaScript number
values.
The semantic actions for the operations add
, sub
, mul
, and div
capture the
corresponding Lisp operator and its right argument in a collection.
The semantic actions at the two precedence levels of arithmetic, prod
and expr
share a function fromLeft
which creates Lisp notation to combine all operations
at the same precedence level, associating from left to right:
function fromLeft (values) {
return values.slice(1).reduce(
function (left, right) {
return '(' + right.op + ' ' + left + ' ' + right.val + ')';
}, values[0]);
}
fromLeft
uses JavaScripts reduce
method for Array
objects
to create one branch node at a time,
proceeding from left to right. The first collected value was not processed by one of the operators
but all others are collections. Therefore, fromLeft
combines the first value and the value from the collection
right
with the operator from the right
.
function fromRight (values) {
return values.reduceRight(
function (right, left) {
return '(expt ' + left + ' ' + right + ')';
}, values.pop());
}
The semantic action for exponentiation uses a function from Right
which is based on JavaScripts
reduceRight
method for Array
objects
to create branch nodes proceeding from right to left.
from Right
is much simpler and the default action — value collection — is sufficient for power
because there is only one operation at this precedence level.
Otherwise, if power
were encoded like the other operations
power: function (values) { return { op: 'expt', val: values[0] }; },
In this case fromRight
has to treat the leftmost value as a special case and encode the intermediate terms
function fromRight (values) {
return values.reduceRight(
function (right, left) {
return typeof left == 'number' ?
'(' + right.op + ' ' + left + ' ' + right.val + ')' :
{ op: left.op, val: '(' + right.op + ' ' + left.val + ' ' + right .val + ')' };
}, values.pop());
}
This would be the general blueprint for right-associativity.
The formula 1 + 2*3 - 45/(1 + 2 ** 3)
is represented as Lisp expression
(- (+ 1 (* 2 3)) (/ 45 (+ 1 (expt 2 3))))
Note how multiplication and exponentiation as well as parentheses do take precedence and that addition and subtraction associate to the left.
As for right-associativity, (2 ** 3 ** 4 ** 5)
is represented as
(expt 2 (expt 3 (expt 4 5)))
Compiling arithmetic expressions
JavaScript has functions as first-order values, i.e., they can be passed as arguments and returned as values. It is tempting to build an arithmetic expression into a nest of functions:
actions = {
expr: fromLeft,
add: function (values) { return function (left) { return left() + values[0](); } },
sub: function (values) { return function (left) { return left() - values[0](); } },
prod: fromLeft,
mul: function (values) { return function (left) { return left() * values[0](); } },
div: function (values) { return function (left) { return left() / values[0](); } },
factor: fromRight,
num: function (values) { return function () { return Number(values[0]); }; }
};
num
turns a numerical value from a string into a parameter-less function returning the value.
An operator such as add
expects the collected value to be a parameter-less function returning the right-hand operand value.
The semantic actions returns a function which expects a parameter-less function returning the left-hand operand value,
evaluates to get both operand values, combines the values and returns the value.
combines it with the right-hand value, and returns the result.
function fromLeft (values) {
return values.slice(1).reduce(
function (left, right) {
return function () { return right(left); };
}, values[0]);
}
fromLeft
creates a left-associative tower of parameter-less function calls.
Exponentiation is again handled as a special case of right-associativity:
function fromRight (values) { // special case: Math.pow only
return values.reduceRight(
function (right, left) {
return function () { return Math.pow(left(), right()); };
}, values.pop());
}
The result of parsing 1 + 2*3 - 45/(1 + 2 ** 3)
looks somewhat disappointing:
function () { return right(left); }
On the other hand, the trace shows that the result is an array containing a function which can be unwrapped and executed:
ll1.parser(input, ll1.reflect(null, actions))[0]()
This does return the expected value 2
. Note that the result of a reflect
observer will be an array
containing whatever values were collected for rule 0, i.e., for the start symbol of the grammar. The code shows
how to extract the function from the array and call it.
Adding memory
Here is a bit of a twist:
num: function (_, _tuple, _values) { return function () { return memory[Number(_values[0])]; }; }
This way num
turns a Number
value into a reference to the array memory
which at this point does not even exist.
Parsing
7 + 6*5 - 4/(3 + 2 ** 1)
produces the result
function () { return item(total); }
But if memory
is set and the result of parsing unwrapped and executed
expr = ll1.parser(input, ll1.reflect(null, actions))[0];
memory = [0, 2, 4, 7, 69, 5, 4, 3]
print( memory.join(' ') )
print( expr() )
delete memory
memory = [0, 3, 2, 1, 45, 3, 2, 1]
print( memory.join(' ') )
print( expr() )
we get
0 2 4 7 69 5 4 3
20
0 3 2 1 45 3 2 1
2
i.e., the result expr
is a nest of functions representing the parsed sequence of operations
which can be executed
on whatever values are stored into memory
— even if the name is destroyed and a new binding is created.
A little language
Programming languages usually combine arithmetic expressions with the ability to control the flow of execution. Euclid's algorithm to compute the greatest common divisor of two natural numbers exercises both aspects:
while (x != y)
if (x > y) x = x - y;
else y = y - x;
The notation is borrowed from JavaScript and many other languages.
This input is a sentence for a grammar which combines the grammar used for most of this chapter
with control flow and the use of variable names.
The following needs to be inserted between the definitions for Number
and expr
:
%token Name '[a-z_A-Z][a-z_A-Z0-9]*';
program: statement*;
statement: assign | if | while | block;
assign: Name '=' expression ';';
if: 'if' '(' expression ')' statement else?;
else: 'else' statement;
while: 'while' '(' expression ')' statement;
block: '{' statement* '}';
expression: conj or*;
or: '||' conj;
conj: cmp and*;
and: '&&' cmp;
cmp: expr ((lt | le | gt | ge | eq | ne) expr)?;
lt: '<';
le: '<=';
gt: '>';
ge: '>=';
eq: '==';
ne: '!=';
prim
needs to include a variable Name
in addition to a Number
:
prim: num | name | '(' expression ')';
num: Number;
name: Name;
prim
allows a full expression
so that all operations can use parentheses.
The semantic actions discussed at the end of the last section
implement expr
as parameter-less functions. They have to be augmented
to handle control flow, conditions, and variable names.
function sequence (values) {
return values.slice(1).reduce(
function (left, right) {
return function () { left(); return right(); };
}, values[0]);
}
actions = {
program: sequence,
assign: function (values) {
return function () { memory[values[0]] = values[1](); };
},
The result of most Builder functions are parameter-less functions.
sequence
uses reduce
to build a tower of function calls which calls such functions
in order from left to right.
One such function could be an assignment:
assign
expects the string value for a Name
, i.e, a variable name,
and a parameter-less function producing the value to be assigned, and returns a parameter-less function
which will evaluate and assign to some memory
location identified by name.
The other statements are similarly mapped to JavaScript:
'if': function (values) {
if (values.length == 2)
return function () { if (values[0]()) values[1](); };
else
return function () { if (values[0]()) values[1](); else values[2](); };
},
'while': function (values) {
return function () { while (values[0]()) values[1](); };
},
block: sequence,
The arithmetic expressions have to be augmented with comparisons and logical operations but they all follow the familiar pattern:
expression: fromLeft,
or: function (values) { return function (left) { return left() || values[0](); } },
conj: fromLeft,
and: function (values) { return function (left) { return left() && values[0](); } },
cmp: function (values) {
if (values.length == 1) return values[0];
return function() { return values[1]( values[0], values[2]); };
},
lt: function (values) { return function (left, right) { return left() < right(); } },
le: function (values) { return function (left, right) { return left() <= right(); } },
gt: function (values) { return function (left, right) { return left() > right(); } },
ge: function (values) { return function (left, right) { return left() >= right(); } },
eq: function (values) { return function (left, right) { return left() == right(); } },
ne: function (values) { return function (left, right) { return left() != right(); } },
JavaScript permits chaining comparison operators; however, a < b < c
does not bracket b
between a
and c
.
Therefore, the grammar of the little language discussed here does not permit such chaining,
at least not without using parentheses. The grammar forces comparison to be non-associative.
prim
added name
to produce the value associated with a variable name and
num
, of course, needs to produce the value of a Number
token:
num: function (values) { return function () { return Number(values[0]); }; },
name: function (values) { return function () { return memory[values[0]]; }; }
memory
now is a collection of values, selected by variable names. Here is how Euclid's algorithm would
be compiled and executed:
ll1 = new page.LL1(page.EBNF.from( Grammar.value )); // the grammar
input = ll1.tokenizer()( Input.value ); // the program
expr = ll1.parser(input, ll1.reflect(null, actions))[0]; // the executable code
memory = { x: 36, y: 54 } // initialize x and y
expr() // run the executable code
print( memory.x + ' ' + memory.y )
This prints
18 18
That's all, folks!