Observers

Tracing

A trace of parser execution was useful, e.g., to judge the quality of error recovery:

error: (5) ']' is not allowed in "'(' argument */ ',' ')'"
expecting: ')'
4   (5) ']'                 reduce  parentheses: '(' argum  obj one,2
4   (5) ']'                 shift   ']'                     obj null
4   (5) '.'                 shift   '.'                     obj null
3   (6) '}'                 reduce  brackets: '[' parenthe  obj one,2

Tracing is implemented using the Observer design pattern: A subject is informed about an observer and sends back messages as the state of the subject changes. The page.LL1 object plays the role of subject which calls an Observer function. The initial connection is either made when the parser is constructed or when one of the parsing methods is called.

For top-down parsing with either, page.LL1#parse or page.LL1#parser, there are four messages describing state changes:

  • rule indicates that the parser starts to match a new rule.

  • shift indicates that a terminal symbol has been matched.

  • reduce indicates that the parser is done matching a rule.

  • error, finally, indicates that the parser could not match the current input symbol.

Each message is accompanied by additional information:

  • a number indicating a position in the grammar which could be a rule number,

  • the current input page.Tuple,

  • a page.BNF.Rule or page.BNF.T object as context from the grammar, or an error message string,

  • and a reference to the subject sending the message which could be queried for other information.

Tracing is most convenient if it can be added quickly at any time. Therefore, page has a factory method page.trace to create observers which will simply print the messages.

For example, the stack algorithm for the error recovery example can be traced as follows:

ll1 = new page.LL1(page.EBNF.from( Grammar.value ), page.trace());
input = ll1.tokenizer()( Input.value );
ll1.parse(input.concat([null]));

If the trace of the recursive algorithm needs a new header, a fresh observer can be provided when the algorithm is invoked:

ll1.parser(input, page.trace(/reduce/));

The trace can be restricted to messages selected by a pattern argument to page.trace.

Collecting values

So far, input symbols are abandoned as soon as they are matched during parsing. An observer can change that because it sees all matched tuples during shift messages:

collect = [];
function observer (_at, _tuple, _action, _info, _sender) {
  if (_action == 'shift') collect.push(_tuple.value);
  return collect;
}

For example, the following EBNF grammar describes a sum as a sequence of numbers separated by plus or minus signs:

%token Number '[0-9]+';
sum: Number +/ ('+' | '-');

The observer can be used to collect all input symbols:

ll1 = new page.LL1(page.EBNF.from( Grammar.value ), observer);
input = ll1.tokenizer()( '1 + 3 - 5 + 7' );
ll1.parser(input).join(' . ')

By convention, the parsing algorithm returns the last value which it received from the observer. Here this is

1 . + . 3 . - . 5 . + . 7 . 

These are the input symbols (pushed into the array collect above), joined with blanks and periods for display purposes. The trailing period is caused by matching and collecting $eof.

page.trace accepts an observer as a first argument and observes its result:

ll1.parser(input, page.trace(observer))

produces

AT  TUPLE            ACTION                          RETURN
0   (2) Number '1'   rule   $accept: sum $eof;       [ ] 
1   (2) Number '1'   rule   sum: Number +/ ('+' | '  [ ] 
1   (2) Number '1'   shift  Number                   [ ] 1
1   (2) '+'          shift  '+'                      [ ] 1,+
1   (2) Number '3'   shift  Number                   [ ] 1,+,3
1   (2) '-'          shift  '-'                      [ ] 1,+,3,-
1   (2) Number '5'   shift  Number                   [ ] 1,+,3,-,5
1   (2) '+'          shift  '+'                      [ ] 1,+,3,-,5,+
1   (2) Number '7'   shift  Number                   [ ] 1,+,3,-,5,+,7
1   (0) $eof         reduce sum: Number +/ ('+' | '  [ ] 1,+,3,-,5,+,7
1   (0) $eof         shift  $eof                     [ ] 1,+,3,-,5,+,7,
0   (0) $eof         reduce $accept: sum $eof;       [ ] 1,+,3,-,5,+,7,
1,+,3,-,5,+,7,

The RETURN column shows how collect acquires one matched input symbol after another.

Collecting values turns out to be so useful for further processing that it is built into page as well. page.LL1#reflect returns an observer which collects those input strings which were matched by page.BNF.Token patterns:

ll1.parser(input, ll1.reflect());

produces a flat array

1,3,5,7

There is a significant difference between the two built-in observers: page.trace is a static function and produces a state-less observer, page.LL1#reflect is a method of a parser object because the observer has state — the collected values. Either observer is serially (but not concurrently) reusable.

Semantic actions

What can be done with the collected values? The sum example hints that one could try to build a calculator. However, by design choice page.LL1#reflect does not collect literal operators. Here is a trace for a slightly different grammar:

%token Number '[0-9]+';

sum: Number (add | sub)*;
add: '+' Number;
sub: '-' Number;

The reduce messages from parsing '1 + 3 - 5 + 7'

AT  TUPLE            ACTION                          RETURN
2   (2) '-'          reduce add: '+' Number;         [ ] 3
3   (2) '+'          reduce sub: '-' Number;         [ ] 5
2   (0) $eof         reduce add: '+' Number;         [ ] 7
1   (0) $eof         reduce sum: Number (add | sub)  [ ] 1,3,5,7
0   (0) $eof         reduce $accept: sum $eof;       [ ] 1,3,5,7

show how each but the first Number is collected in a reduce message for a rule which is specific to the desired arithmetic operation. The first Number is collected in rule 1, near the end of the trace.

Note that the value collected for a token is a string from the input. To avoid a lot of explicit conversions, the grammar is changed so that Number appears in only one rule:

sum: num (add | sub)*;
add: '+' num;
sub: '-' num;
num: Number;

Given a collection of functions as an argument, page.LL1#reflect can call a Builder function corresponding to the name of the non-terminal on the left-hand side:

actions = {
num: function (values) { return Number(values[0]); },
add: function (values, env) { return env[0] += values[0]; },
sub: function (values, env) { return env[0] -= values[0]; },
sum: function (values, env) { return values[0] + env[0]; }
};

If a Builder function with the name of a rules left-hand side can be found, it receives

  • the list of values collected for the rule,

  • an environment object originally passed to reflect,

  • and a reference to the parser which can be queried for the current input page.Tuple.

The function can perform arbitrary actions. It is known as a semantic action because it defines the meaning of the grammar rule.

The function result is collected by reflect into the list for the encompassing rule; an array result is flattened.

The function num shown above converts the input string representing Number into a numerical value and returns that to reflect for the outer rules. The grammar rule for num above was added exactly to place the conversion into a single Builder function.

add and sub assume that env is an array, initialized with a single element of value zero, and accumulate the running total in that array element. Finally, sum combines the value of the first Number with the running total in the environment and returns the result. The environment is initially handed to reflect, together with the Builder function collection:

ll1.parser(input, page.trace( ll1.reflect([0], actions), /reduce/));

The reduce messages from the trace show how the numbers are collected until the total 6 is computed.

AT  TUPLE            ACTION                          RETURN
4   (2) '+'          reduce num: Number;             num 1
4   (2) '-'          reduce num: Number;             num 3
2   (2) '-'          reduce add: '+' num;            num 3
4   (2) '+'          reduce num: Number;             num 5
3   (2) '+'          reduce sub: '-' num;            num -2
4   (0) $eof         reduce num: Number;             num 7
2   (0) $eof         reduce add: '+' num;            num 5
1   (0) $eof         reduce sum: num (add | sub)*;   num 6
0   (0) $eof         reduce $accept: sum $eof;       [ ] 6
6

The trace shows that all the return values passed through reflect are numerical values. However, there is no semantic action for rule 0; therefore, the final value returned from reflect is an array containing the last numerical value.

This example should suffice to explain why by design choice reflect does not collect literal input strings: they are already known in the grammar. Only strings representing token patterns are interesting for further processing. Experience shows that the grammar can always be (re-)written so that a Builder function knows exactly what literals and how many token values it deals with.

Similarly, while a Builder function can return an array with zero or more values to reflect, and while these values are flattened into the list for the next outer rule, the next outer Builder function knows from its corresponding grammar rule that it receives values for a specific non-terminal or EBNF repetition construct and should be able to deal even with a variable number of such values.

As an example, consider the fact that, technically, arithmetic is not quite performed from left to right with the Builder functions above. The following semantic actions eliminate this issue:

actions = {
num: function (values) { return Number(values[0]); },
sub: function (values) { return - values[0]; },
sum: function (values) {
    return values.slice(1).reduce(
        function (left, right) { return left + right; },
        values[0]
      ); 
  }       
};

print( ll1.parser(input, page.trace(ll1.reflect(null, actions), /reduce/)) )

This time the environment is not needed; therefore, null is sent to reflect initially. The trace shows that the add rule relies on reflect to hand the value to the outer rule and that the sub rule merely changes the sign of the numerical value.

AT  TUPLE            ACTION                          RETURN
4   (2) '+'          reduce num: Number;             num 1
4   (2) '-'          reduce num: Number;             num 3
2   (2) '-'          reduce add: '+' num;            [ ] 3
4   (2) '+'          reduce num: Number;             num 5
3   (2) '+'          reduce sub: '-' num;            num -5
4   (0) $eof         reduce num: Number;             num 7
2   (0) $eof         reduce add: '+' num;            [ ] 7
1   (0) $eof         reduce sum: num (add | sub)*;   num 6
0   (0) $eof         reduce $accept: sum $eof;       [ ] 6
6

The actual computation takes place in the sum function which receives the first number and any number of additional summands. While right recursion is used to implement the iteration in the grammar rule for sum, arithmetic is still performed from left to right, courtesy of the JavaScript reduce method for Array objects.

Interpreting arithmetic expressions

The previous example evaluates strictly from left to right which is sufficient for addition mixed with subtraction. Full-scale arithmetic, however, requires operator precedence and parentheses. The grammar has to be extended:

%token Number '[0-9]+';

expr: prod (add | sub)*;
add: '+' prod;
sub: '-' prod;

prod: factor (mul | div)*;
mul: '*' factor;
div: '/' factor;

factor: prim power*;
power: '**' prim;

prim: num | '(' expr ')';
num: Number;

Commonly, exponentiation is evaluated from right to left and takes precedence over the other binary operations which are evaluated from left to right. Multiplication and division take precedence over addition and subtraction. Before the addition in

2 + 3 * 4

can happen, the rule for prod has to be reduced and the observer created by page.LL1#reflect will execute the corresponding function which can perform the multiplication prior to the addition. Here are the necessary Builder functions for a full-scale interpreter:

actions = {

  expr: function (values) {
    return values.slice(1).reduce(
      function (left, right) { return left + right; }, values[0]); 
  },       

  sub: function (values) { return - values[0]; },

  prod: function (values) {
    return values.slice(1).reduce(
      function (left, right) { return left * right; }, values[0]); 
  },       

  div: function (values) { return 1 / values[0]; },

  factor: function (values) {
    return values.reduceRight(
      function (right, left) { return Math.pow(left, right); }, values.pop());
  },

  num: function (values) { return Number(values[0]); }
};

The functions for expr and prod perform left-to-right evaluation. It is debatable if creating an inverse in the function for div for later multiplication in prod is numerically very sound, but at least JavaScript performs floating point arithmetic and will not crash if a division by zero is attempted. Alternatively, the observer could throw an error message to abort interpretation.

Right associativity is implemented using the JavaScript reduceRight method for Array objects which, however, presents the operands in reverse order for exponentiation with Math.pow().

The trace shows how the computation of

1 + 2*3 - 45/(1 + 2 ** 3)

progresses:

AT  TUPLE            ACTION                          RETURN
10  (2) '+'          reduce num: Number;             num 1
9   (2) '+'          reduce prim: num | '(' expr ')  [ ] 1
7   (2) '+'          reduce factor: prim power*;     num 1
4   (2) '+'          reduce prod: factor (mul | div  num 1
10  (2) '*'          reduce num: Number;             num 2
9   (2) '*'          reduce prim: num | '(' expr ')  [ ] 2
7   (2) '*'          reduce factor: prim power*;     num 2
10  (2) '-'          reduce num: Number;             num 3
9   (2) '-'          reduce prim: num | '(' expr ')  [ ] 3
7   (2) '-'          reduce factor: prim power*;     num 3
5   (2) '-'          reduce mul: '*' factor;         [ ] 3
4   (2) '-'          reduce prod: factor (mul | div  num 6
2   (2) '-'          reduce add: '+' prod;           [ ] 6
10  (2) '/'          reduce num: Number;             num 45
9   (2) '/'          reduce prim: num | '(' expr ')  [ ] 45
7   (2) '/'          reduce factor: prim power*;     num 45
10  (2) '+'          reduce num: Number;             num 1
9   (2) '+'          reduce prim: num | '(' expr ')  [ ] 1
7   (2) '+'          reduce factor: prim power*;     num 1
4   (2) '+'          reduce prod: factor (mul | div  num 1
10  (2) '**'         reduce num: Number;             num 2
9   (2) '**'         reduce prim: num | '(' expr ')  [ ] 2
10  (2) ')'          reduce num: Number;             num 3
9   (2) ')'          reduce prim: num | '(' expr ')  [ ] 3
8   (2) ')'          reduce power: '**' prim;        [ ] 3
7   (2) ')'          reduce factor: prim power*;     num 8
4   (2) ')'          reduce prod: factor (mul | div  num 8
2   (2) ')'          reduce add: '+' prod;           [ ] 8
1   (2) ')'          reduce expr: prod (add | sub)*  num 9
9   (0) $eof         reduce prim: num | '(' expr ')  [ ] 9
7   (0) $eof         reduce factor: prim power*;     num 9
6   (0) $eof         reduce div: '/' factor;         num 0.1111111111111111
4   (0) $eof         reduce prod: factor (mul | div  num 5
3   (0) $eof         reduce sub: '-' prod;           num -5
1   (0) $eof         reduce expr: prod (add | sub)*  num 2
0   (0) $eof         reduce $accept: expr $eof;      [ ] 2
2

The trace shows that rule 4 — the product of 2 and 3 — is reduced long before rule 1 — the overall sum — which has lowest precedence and is reduced just before rule 0 accepts the expression. The effect of the parentheses and precedence of ** over + can be seen by the order in which rule 7 — 2 to the third power, rule 1 — the sum of 1 and 2 ** 3, and rule 4 — the quotient — are reduced.

It should be noted that the same set of Builder functions always works for both, the recursive page.LL1#parser algorithm and the stack-based page.LL1#parse algorithm. The latter is strictly based on a BNF grammar, i.e., value collection works equally well for grammars described using BNF or extended BNF.

Representing arithmetic expressions

An arithmetic expression is often represented as a tree with the operations in the branch nodes and the numbers in the leaves. Lisp's prefix notation is a good way to display such a tree. The following Builder functions create Lisp notation from an arithmetic expression which is a sentence of the grammar shown before.

actions = {

  expr: fromLeft,       
  add: function (values) { return { op: '+', val: values[0] }; },
  sub: function (values) { return { op: '-', val: values[0] }; },

  prod: fromLeft,       
  mul: function (values) { return { op: '*', val: values[0] }; },
  div: function (values) { return { op: '/', val: values[0] }; },

  factor: fromRight,

  num: function (values) { return Number(values[0]); }
};

As before, num converts all matched Number tokens into JavaScript number values.

The semantic actions for the operations add, sub, mul, and div capture the corresponding Lisp operator and its right argument in a collection.

The semantic actions at the two precedence levels of arithmetic, prod and expr share a function fromLeft which creates Lisp notation to combine all operations at the same precedence level, associating from left to right:

function fromLeft (values) {
  return values.slice(1).reduce(
    function (left, right) {
      return '(' + right.op + ' ' + left + ' ' + right.val + ')';
    }, values[0]);
}

fromLeft uses JavaScripts reduce method for Array objects to create one branch node at a time, proceeding from left to right. The first collected value was not processed by one of the operators but all others are collections. Therefore, fromLeft combines the first value and the value from the collection right with the operator from the right.

function fromRight (values) {
  return values.reduceRight(
    function (right, left) {
      return '(expt ' + left + ' ' + right + ')';
    }, values.pop());
}

The semantic action for exponentiation uses a function from Right which is based on JavaScripts reduceRight method for Array objects to create branch nodes proceeding from right to left. from Right is much simpler and the default action — value collection — is sufficient for power because there is only one operation at this precedence level.

Otherwise, if power were encoded like the other operations

  power: function (values) { return { op: 'expt', val: values[0] }; },

In this case fromRight has to treat the leftmost value as a special case and encode the intermediate terms

function fromRight (values) {
  return values.reduceRight(
    function (right, left) {
      return typeof left == 'number' ?
        '(' + right.op + ' ' + left + ' ' + right.val + ')' :
        { op: left.op, val: '(' + right.op + ' ' + left.val + ' ' + right .val + ')' };
    }, values.pop());
}

This would be the general blueprint for right-associativity.

The formula 1 + 2*3 - 45/(1 + 2 ** 3) is represented as Lisp expression

(- (+ 1 (* 2 3)) (/ 45 (+ 1 (expt 2 3))))

Note how multiplication and exponentiation as well as parentheses do take precedence and that addition and subtraction associate to the left. As for right-associativity, (2 ** 3 ** 4 ** 5) is represented as

(expt 2 (expt 3 (expt 4 5)))

Compiling arithmetic expressions

JavaScript has functions as first-order values, i.e., they can be passed as arguments and returned as values. It is tempting to build an arithmetic expression into a nest of functions:

actions = {

  expr: fromLeft,
  add: function (values) { return function (left) { return left() + values[0](); } },
  sub: function (values) { return function (left) { return left() - values[0](); } },

  prod: fromLeft,
  mul: function (values) { return function (left) { return left() * values[0](); } },
  div: function (values) { return function (left) { return left() / values[0](); } },

  factor: fromRight,

  num: function (values) { return function () { return Number(values[0]); }; }
};

num turns a numerical value from a string into a parameter-less function returning the value. An operator such as add expects the collected value to be a parameter-less function returning the right-hand operand value. The semantic actions returns a function which expects a parameter-less function returning the left-hand operand value, evaluates to get both operand values, combines the values and returns the value. combines it with the right-hand value, and returns the result.

function fromLeft (values) {
  return values.slice(1).reduce(
    function (left, right) {
      return function () { return right(left); };
    }, values[0]);
}

fromLeft creates a left-associative tower of parameter-less function calls. Exponentiation is again handled as a special case of right-associativity:

function fromRight (values) {  // special case: Math.pow only
  return values.reduceRight(
    function (right, left) {
      return function () { return Math.pow(left(), right()); };
    }, values.pop());
}

The result of parsing 1 + 2*3 - 45/(1 + 2 ** 3) looks somewhat disappointing:

function () { return right(left); }

On the other hand, the trace shows that the result is an array containing a function which can be unwrapped and executed:

ll1.parser(input, ll1.reflect(null, actions))[0]()

This does return the expected value 2. Note that the result of a reflect observer will be an array containing whatever values were collected for rule 0, i.e., for the start symbol of the grammar. The code shows how to extract the function from the array and call it.

Adding memory

Here is a bit of a twist:

  num: function (_, _tuple, _values) { return function () { return memory[Number(_values[0])]; }; }

This way num turns a Number value into a reference to the array memory which at this point does not even exist. Parsing

7 + 6*5 - 4/(3 + 2 ** 1)

produces the result

function () { return item(total); }

But if memory is set and the result of parsing unwrapped and executed

expr = ll1.parser(input, ll1.reflect(null, actions))[0];

memory = [0, 2, 4, 7, 69, 5, 4, 3]
print( memory.join(' ') )
print( expr() )

delete memory
memory = [0, 3, 2, 1, 45, 3, 2, 1]
print( memory.join(' ') )
print( expr() )

we get

0 2 4 7 69 5 4 3
20
0 3 2 1 45 3 2 1
2

i.e., the result expr is a nest of functions representing the parsed sequence of operations which can be executed on whatever values are stored into memory — even if the name is destroyed and a new binding is created.

A little language

Programming languages usually combine arithmetic expressions with the ability to control the flow of execution. Euclid's algorithm to compute the greatest common divisor of two natural numbers exercises both aspects:

while (x != y)
  if (x > y) x = x - y;
  else y = y - x;

The notation is borrowed from JavaScript and many other languages. This input is a sentence for a grammar which combines the grammar used for most of this chapter with control flow and the use of variable names. The following needs to be inserted between the definitions for Number and expr:

%token Name '[a-z_A-Z][a-z_A-Z0-9]*';

program: statement*;

statement: assign | if | while | block;
assign: Name '=' expression ';';
if: 'if' '(' expression ')' statement else?;
else: 'else' statement;
while: 'while' '(' expression ')' statement;
block: '{' statement* '}';

expression: conj or*;
or: '||' conj;

conj: cmp and*;
and: '&&' cmp;

cmp: expr ((lt | le | gt | ge | eq | ne) expr)?;
lt: '<';
le: '<=';
gt: '>';
ge: '>=';
eq: '==';
ne: '!=';

prim needs to include a variable Name in addition to a Number:

prim: num | name | '(' expression ')';
num: Number;
name: Name;

prim allows a full expression so that all operations can use parentheses.

The semantic actions discussed at the end of the last section implement expr as parameter-less functions. They have to be augmented to handle control flow, conditions, and variable names.

function sequence (values) {
  return values.slice(1).reduce(
    function (left, right) {
      return function () { left(); return right(); };
    }, values[0]);
}

actions = {
  program: sequence,

  assign: function (values) {
    return function () { memory[values[0]] = values[1](); };
  },

The result of most Builder functions are parameter-less functions. sequence uses reduce to build a tower of function calls which calls such functions in order from left to right. One such function could be an assignment: assign expects the string value for a Name, i.e, a variable name, and a parameter-less function producing the value to be assigned, and returns a parameter-less function which will evaluate and assign to some memory location identified by name.

The other statements are similarly mapped to JavaScript:

  'if': function (values) {
    if (values.length == 2)
      return function () { if (values[0]()) values[1](); };
    else
      return function () { if (values[0]()) values[1](); else values[2](); };
  },

  'while': function (values) {
    return function () { while (values[0]()) values[1](); };
  },

  block: sequence,

The arithmetic expressions have to be augmented with comparisons and logical operations but they all follow the familiar pattern:

  expression: fromLeft,
  or: function (values) { return function (left) { return left() || values[0](); } },

  conj: fromLeft,
  and: function (values) { return function (left) { return left() && values[0](); } },

  cmp: function (values) {
    if (values.length == 1) return values[0];
    return function() { return values[1]( values[0], values[2]); };
  },
  lt: function (values) { return function (left, right) { return left() < right(); } },
  le: function (values) { return function (left, right) { return left() <= right(); } },
  gt: function (values) { return function (left, right) { return left() > right(); } },
  ge: function (values) { return function (left, right) { return left() >= right(); } },
  eq: function (values) { return function (left, right) { return left() == right(); } },
  ne: function (values) { return function (left, right) { return left() != right(); } },

JavaScript permits chaining comparison operators; however, a < b < c does not bracket b between a and c. Therefore, the grammar of the little language discussed here does not permit such chaining, at least not without using parentheses. The grammar forces comparison to be non-associative.

prim added name to produce the value associated with a variable name and num, of course, needs to produce the value of a Number token:

  num: function (values) { return function () { return Number(values[0]); }; },
  name: function (values) { return function () { return memory[values[0]]; }; }

memory now is a collection of values, selected by variable names. Here is how Euclid's algorithm would be compiled and executed:

ll1 = new page.LL1(page.EBNF.from( Grammar.value ));      // the grammar
input = ll1.tokenizer()( Input.value );                   // the program
expr = ll1.parser(input, ll1.reflect(null, actions))[0];  // the executable code

memory = { x: 36, y: 54 }           // initialize x and y
expr()                              // run the executable code
print( memory.x + ' ' + memory.y )

This prints

18 18

That's all, folks!

Next: Bootstrap Previous: Top-down parsing