Converting LR to LL


Contents:

A guide to converting YACC-based grammars

When using PCCTS or Antlr to implement existing grammars, one usually has to deal with existing YACC grammars. Unfortunately, converting from LR, or more accurately LALR, to LL is not the most straightforward task. Recently, I've converted GNU's YACC based Smalltalk-80 grammar to PCCTS and have compiled a number of notes during the conversion. Some of these notes are not new, and have already been documented in Parr's book or Moog's notes.

The Basics

After several hours research, little could be found for a formal definition of BNF and EBNF. What is written here is what seems to be a general consensus across the Internet and through my compiler and formal language threory books.

BNF grammars uses the following constructs: '*' (kleen star), '|' (or), and parenthesis for grouping. EBNF extends this by adding '+' and braces for optional branches. Note that EBNF does not do anything new, every EBNF construct can be translated to BNF. The following is a grammar for EBNF in EBNF (using PCCTS/Antlr syntax for consistency):

  syntax      : ( rule )*;
  rule        : nonterminal ":" expression ";";
  expression  : term ( "|" term )*;
  term        : ( factor )*; // either zero or more, or lambda as factor
  factor      : identifier
              | string
              | "(" expression ")" { ( "*" | "+" ) }
              | "{" expression "}"
              ;
  identifier  :  letter { letter | digit };
  string      : """ ( any_character )* """;

The following shows some possible translations:

  BNF      EBNF    Description
  (|a)     {a}     nil or one 'a' versus an optional 'a'.
  a(a)*    (a)+    'a' followed by zero or more 'a's versus one or more 'a's
  (|(a)+)  (a)*    nil or one or more 'a's versus zero or more a's
  or:      {(a)+}  an optional one or more 'a's

It should be noted BNF to EBNF translations are one to one, meaning they can be performed both ways and still be valid.

Removing Left Recursion

Knowing how to translate BNF to EBNF (and back) and a few substitutions is the first step to getting LR to LL. First, left recursion is a no-no in LL parsing and is very common in LR grammars. YACC constructs like the following must be rewritten.

  variable_names
      : variable_name
      | variable_names variable_name

the left hand variable_names is a left recursive construct. With EBNF it is very easy to rewrite that rule as:

  variable_names
      : (variable_name)+

likewise:

  arguments
      : arguments argument
      | ;

should be:

  arguments: (argument)*;

These are often seen in YACC grammars for lists of things. Be sure to remove all left recursive productions and BNF stupidities before moving on.

Translating Common LR Patterns

While translating a YACC based Smalltalk-80 grammar, a number of patterns started to occur. The first of which is that many of the LR productions were made solely by the fact that YACC utilizes BNF and therefore can't express as much as EBNF. Cleaning many of these rules up and using PCCTS/Antlr's cross reference generators showed many of these rules being used in only one or two places, and made more sense if substituted. This reduced our rules by about 10 rules and vastly increased the conceptual integrity of the grammar.

Circular Recursion:

Productions of the abstract form:

  a: A b | A;
  b: B a | B;

come in many forms and flavors, all of which can produce ambiguities in LL grammars. The hardest part of correcting these are finding them, since they can theoretically span any number of rules before a cycle is produced.

Once you find this cycle, the easiest way (although sometimes a very bad solution) is to substitute a rule production in order to remove the cycle. The previous example now becomes:

  a: A (B a | B) | A;
  b: B a | B;

Note that 'a' can be rewritten as:

  a: A { B {a} };

Also, if 'b' was only used by 'a' initially, then it is now useless.

I have found that PCCTS/Antlr will not exactly like productions that start with a (...)* subproduction. It is often better to use right recursive productions instead. This occured in a number of rules in ST-80.

Left-Factoring:

Productions of the form:

  a: A b | A c;

are better formulated as:

  a: A ( b | c );

Also note that this also applies to productions spanning multiple rules:

  a: b | c;
  b: A d;
  c: A e;

is better stated as:

  a: A ( b | c );
  b: d;
  c: e;

This is also known as "hoisting" left factors.

All of the previous notes are fairly trivial and can be algorithmically applied to any YACC grammar. At this point, PCCTS/Antlr will probably work with your grammar. However, you are doing it wrong!!!

Reading LR as LL or "You are doing it wrong"

See, the problem is this: We (I) have been looking at the grammar the wrong way. YES, all transformations are perfectly valid, as most of them are really just BNF to EBNF optimizations or the normal/necessary removal of left recursions or cyclic recursions. But it isn't enough. We have been looking at and reading an LR grammar LL. In otherwords, while the YACC generated parser deals with matching the rules from right to left, we have been translating from left to right. This is a simple and natural mistake to make if you only write LL grammars and are new to this.

CASE STUDY

The following is a subset of ST-80 in lobotimized form:

  me : ue | be | ke;
  ue : uod US;
  be : bod BS uod;
  ke : bod (K bod)+;
  uod: P | ue;
  bod: uod | be;
Step 1: Clean cycles via substitution (we choose 'uod' and 'bod' as targets)

Substitute rules for the indirect cycles, and then clean them up:

  uod: P | ue   becomes: P | uod US         becomes: P (US)*
  bod: uod | be becomes: uod | bod (BS uod) becomes: uod (BS uod)*

So now we have:

  me : ue | be | ke;
  ue : uod US;
  be : bod BS uod;
  ke : bod (K bod)+;
  uod: P (US)*;
  bod: uod (BS uod)*;
Step 2: Clean grammar by substituting uod and bod. Left factor (and hoist) 'p'

ue, be, ke all boil down to uod, so we know that we start w/ p. Hoisting it and substituting around should give us a much cleaner grammar:

  me : P (ue | be | ke);
  ue : (US)* US                 becomes: (US)+;
  be : (US)* (BS uod)* (BS uod) becomes: (US)* (BS uod)+;
  ke : (US)* (BS uod)* (K bod)+
  uod: P | ue;
  bod: uod | be;

or:

  me : P (ue | be | ke);
  ue : (US)+;
  be : (US)* (BS uod)+;
  ke : (US)* (BS uod)* (K bod)+
  uod: P | ue;
  bod: uod | be;
Step 3: YOU ARE DOING IT WRONG!!!

Looking at the results, one could suppose that further refinement can be done to remove other left factors, but in reality the '+' vs. '*' productions are a difficult and fruitless factoriing job. The real problem lies is that ue, be, and ke all start with us. How is an LL system supposed to know what route to go down? It doesn't!

An LR grammar has no problems with our previous example:

  me : P (ue | be | ke);
  ue : (US)+;
  be : (US)* (BS uod)+;
  ke : (US)* (BS uod)* (K bod)+;
  uod: P | ue;
  bod: uod | be;

because it is coming from the opposite direction as LL. Now that we realize this we can change our perspective on the problem.

By applying the additional (reverse) transformations at the beginning of the text, we can rewrite 'be' as:

  be: { (US)+ } (BS uod)+

or better, realize the equivalence and substitute:

  be: {  ue   } (BS uod)+

We can do the same thing to ke and we get:

  me : P (ue | be | ke);
  ue : (US)+;
  be : {ue} (BS uod)+;
  ke : {ue} (BS uod)* (K bod)+;
  uod: P | ue;
  bod: uod | be;

Not much of a change, but it sets us up for the final transformation.

Step 4: The Final Translation

Note that these are still transformations to the LR grammar. To switch from LR to LL , we simply shift the head-end optional productions to it's tail-end inverse production:

  ue: (US)+ {be};
  be: (BS uod)+;

To generalize the rule:

  lr1: T1;
  lr2: {lr1} T2;

becomes:

  ll1: T1 {ll2};
  ll2: T2;

This can be proven equivalent by converting the productions from their regular expression form to their equivalent DFA & showing the two DFA's to be the same.

By applying these changes to the whole case study we go from:

  me : P (ue | be | ke);
  ue : (US)+;
  be : {ue} (BS uod)+;
  ke : {ue} (BS uod)* (K bod)+;
  uod: P | ue;
  bod: uod | be;

to:

  me : P (ue | be | ke);
  ue : (US)+     {be | ke};
  be : (BS uod)+ {ke};
  ke : (K bod)+;
  uod: P | ue;
  bod: uod | be;

One More Time

Step 1: The original

  me : ue | be | ke;
  ue : uod US;
  be : bod BS uod;
  ke : bod (K bod)+;
  uod: P | ue;
  bod: uod | be;

Step 2: Remove all cyclic rules via rule substitution

  me : ue | be | ke;
  ue : uod US;
  be : bod BS uod;
  ke : bod (K bod)+;
  uod: P | uod US;
  bod: uod | bod BS uod;

Step 3: Rewrite all left-recursive productions

  me : ue | be | ke;
  ue : uod US;
  be : bod BS uod;
  ke : bod (K bod)+;
  uod: P (US)*;
  bod: uod (BS uod)*;

Step 4: Substitute bod in be and ke. Convert be from * to +

  me : ue | be | ke;
  ue : uod US;
  be : uod (BS uod)+;
  ke : uod (BS uod)* (K bod)+;
  uod: P (US)*;
  bod: uod (BS uod)*;

Step 5: Substitute uod in ue, be, and ke. Convert ue from * to +

  me : ue | be | ke;
  ue : P (US)+;
  be : P (US)* (BS uod)+;
  ke : P (US)* (BS uod)* (K bod)+;
  uod: P (US)*;
  bod: uod (BS uod)*;

Step 6: Hoist the P from ue, be, and ke to me

  me : P (ue | be | ke)
  ue : (US)+;
  be : (US)* (BS uod)+;
  ke : (US)* (BS uod)* (K bod)+;
  uod: P (US)*;
  bod: uod (BS uod)*;

Step 7: Rewrite ke and be in terms of ue

  me : P (ue | be | ke)
  ue : (US)+;
  be : {(US)+} (BS uod)+;
  ke : {(US)+} (BS uod)* (K bod)+;
  uod: P (US)*;
  bod: uod (BS uod)*;

Step 8: Substitute ue in be and ke

  me : P (ue | be | ke)
  ue : (US)+;
  be : {ue} (BS uod)+;
  ke : {ue} (BS uod)* (K bod)+;
  uod: P (US)*;
  bod: uod (BS uod)*;

Step 9: Hoist and reverse {ue} from be and ke to ue

  me : P (ue | be | ke)
  ue : (US)+ {be|ke};
  be : (BS uod)+;
  ke : (BS uod)* (K bod)+;
  uod: P (US)*;
  bod: uod (BS uod)*;

Step 10: Rewrite ke in terms of be

  me : P (ue | be | ke)
  ue : (US)+ {be|ke};
  be : (BS uod)+;
  ke : {(BS uod)+} (K bod)+;
  uod: P (US)*;
  bod: uod (BS uod)*;

Step 11: Substitute be in ke

  me : P (ue | be | ke)
  ue : (US)+ {be|ke};
  be : (BS uod)+;
  ke : {be} (K bod)+;
  uod: P (US)*;
  bod: uod (BS uod)*;

Step 12: Hoist and reverse be in ke

  me : P (ue | be | ke)
  ue : (US)+ {be|ke};
  be : (BS uod)+ {ke};
  ke : (K bod)+;
  uod: P (US)*;
  bod: uod (BS uod)*;

At this point, you've got unique terminals on every production start.

Summary

For all head-end options (by "options" I mean productions of the form '(.)*' or '{}' since we don't have to do them) we rewrite them as their BNF equivalent. This is so that, in this example at least, we can see the patterns and match them to their counterparts. Once we match them to the previous rule, we use substitution to actually state the production as an optional rule. Now, for every option on the head-end, we move it as the inverse option on the tail end of the rule it previousely referenced and have it reference whee it came from. This is the LL way of parsing rules and we now have a correct LL parser.