[antlr-interest] tool for left recursion removal

Eric Mahurin eric_mahurin at yahoo.com
Sun Oct 30 06:58:34 PST 2005


Before I continue too far off topic (and give another plug for
my own LL parser), I'll ask again: anybody know of any software
for doing left-recursion removal?

--- Suman Karumuri <mansuk at gmail.com> wrote:

> Hi Eric,
> 
> Why not create a ruby code generation for antlr?

Here are my initial reason for making my own in ruby:

1. My holy grail was to be able to specify a language grammar
directly in  the target language hopefully with BNF-like
syntax.  When you do this, you avoid these problems:

   A. a grammar language to learn (and for the parser to parse)
   B. funny $/macros in your actions
   C. parser may need to parse a subset of the target language
      to handle the actions - at least a lexer?
   D. limitations on the actions for various reasons

2. Even more symmetry between lexer and parser.  With what I
came up with, they are *identical*.  The difference is what
their input and output is.  lexer - input: char stream, output:
token stream; parser - input: token stream, output: AST stream.
 I created a separate package (cursor) to give something like
java streams and C++ iterators but with more abstraction and
power.  One of the results of the symmetry of lexer/parser is
that you can define tokens however you like (they just needs to
respond to the === method).  Ah, the joys of a dynamically
typed language.

3. Wanted to be able to naturally handle arbitrary lookahead. 
ANTLR can handle it, but not very pretty/efficiently last time
I checked - you replicate the grammar in a syntactic predicate.
 With mine, you simply call "lookahead" on a grammar to to
create a grammar with arbitrary lookahead.  No need to specify
a global lookahead for the language.  The default is 1 and if
you need more you just apply this lookahead method where
needed.

4. Wanted the concept that every grammar rule/subrule is an
object.  Thus the project/class name is Grammar.  A rule is
basically an assignment.

5. Very easily add flexibility/power by adding more methods
that can be applied to grammar objects.

6. Users can easily add their own generic grammar
transformations (simply a lambda that takes grammar objects or
whatever as arguments and generates a new grammar object).

7. No need for a parser generation stage.  Effectively, the
parser is made at run-time.  This will result in a slight
startup penalty, but really helps ease of use.  No need to
figure out how to integrate generated files.

8. Well defined way of handling "ambiguities" such that there
are none similar to regular expressions.  Some may consider
this a disadvantage because there are no "ambiguity" warnings. 
In many of the cases where ANTLR gave me these warnings, it was
mainly a matter a figuring out how to turn them off - setting
greedy/non-greedy, etc.  Like regular expression alternations,
mine are also prioritized - first match wins.  This removes the
ambiguities and makes it easier to specify some things.

Later, here are the other advantages I added/found:

9. Instead of a lexer just parsing one token at a time and
possibly retaining state, I define a lexer as a grammar object
that parses the entire stream of characters and generates a
stream of tokens.  For simple lexers, this would just be a
repetition of simple token grammars, but for more complex ones,
each of these tokens might actually be a series of tokens
instead.  Doing it this way makes it so most lexer state is not
needed.  The only lexer state that may be absolutely needed
would be state controlled by the parser.  Doing a lexer this
way also forces it to be in a separate thread from the parser
(unless you want to use continuations).  So we may get a
performance boost from it being multi-threaded.  I could also
make a conventional-style lexer where you only specify one
token, but I haven't made the Cursor class to do it - don't see
the advantage of doing it either.

10. Code flattening.  Instead of each rule generating its own
method and having the run-time/stack penalty of a method call,
I flattened the code so that only when you have recursion do
you need to go to another method.  Intially, I wasn't doing any
code generation and each grammar transformation (sequence,
alternation) was another level on the stack.  For the sake of
performance, I'm now doing code generation with flattening with
"eval".

11. Tail recursion (only in my development version).  Before I
had specific methods for handling various types of grammar
loops.  Now all of those are aliases for tail recursion.  And
you can specify any tail recursion and it will become a loop
instead of actual recursion.  But, you do have to be explicit
about when you want this tail recursion optimization - I'm not
detecting it automatically.



> However, developing a backend for antlr  is still worth it
> because, it
> gives you the flexibility to move across languages. For
> example i am
> generating python code for the prototype, but for the
> production code
> i would be generating c/c++ for the same grammar which is a
> nice thing
> to have.

I've been thinking about making various classes with the same
API (or extend the API) to do things like that:

- generate ruby code to a file so like most other parser
generators

- generate code for other languages - actions would be
specified in strings instead of a simple ruby block

- generate code for C/C++ and Inline it in ruby to make the
parser faster.  This could be done transparently with current
grammars with ruby actions (C code would call the ruby
methods).

- generate syntax diagrams of the grammar

- port this library to other OO languages.  The ideal
candidates would be those with operator overloading,
closures/anonymous functions, and dynamic typing, but I think
these could be overcome with some ugliness in the grammar
specification.

Right now, probably the most portable thing to use would be a
variation of yacc since a variant of it exists in almost every
language.

> -Suman
> On 10/30/05, Eric Mahurin <eric_mahurin at yahoo.com> wrote:
> > Hey all, even though I haven't used ANTLR in about a year
> or
> > so, I still take inspiration from it.  I ended up designing
> my
> > own LL(1/infinite) parser in Ruby and I started with ideas
> from
> > ANTLR.  Unlike other parsers, you specify the grammar
> directly
> > in the target language so that integration with logic
> outside
> > the grammar is easy.  The same techniques could be applied
> to
> > any other OO language, but it would be better if the
> language
> > had operator overloading and closures (Java has neither).
> > Here's a link if anyone is interested:
> >
> > http://rubyforge.org/projects/cursor/
> >
> > Anyways, back to my real question.  Like other LL parsers,
> the
> > big problem is dealing with grammars that have
> left-recursion.
> > I found many links on the techniques to remove
> left-recursion.
> > But, what I would really like is some tool that took a
> grammar
> > (.y file maybe), removed the left-recursion, and spit out a
> new
> > grammar (also a .y file).  Maybe also did left factoring.
> > Anybody know of something that does this?



		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com


More information about the antlr-interest mailing list