[antlr-interest] Re: Summary of ANTLR Issues

Mon Jul 7 15:24:15 PDT 2003

Just to add my two cents ...

--Loring

--- In antlr-interest at yahoogroups.com, Terence Parr <parrt at c...> 
wrote:
> 
> On Monday, July 7, 2003, at 10:22  AM, Tiller, Michael (M.M.) wrote:
> > Synthetic tokens:
> >  
> > I find myself using "synthetic" tokens quite often.  However, I'm 
not 
> > sure if this is because I think in some "strange" way about the 
AST 
> > I'm building or if perhaps that the language I'm parsing just 
doesn't 
> > suit itself to utilizing the existing tokens?
> >  
> > I like to see an AST that is populated with meaningful structural 
> > entities.  That means nodes for "declarations", "definitions", 
> > "inheritance", etc.  All the examples I see in ANTLR use existing 
> > tokens as the roots for these trees.  Now this works reasonably 
well 
> > for expression parsing where the operators exist as tokens already 
and 
> > they really are fundamentally the nodes in the system (i.e. [+ [* 
2 3] 
> > 7]).  But it doesn't seem to make sense for things like 
declarations.  
> > I mean sure you might have a token (e.g. a keyword) in your 
particular 
> > language (e.g. "declare" in something like "declare Int x;") that 
> > would work for this, but I didn't have something like that most of 
the 
> > time.
> 
> Yep...we call these imaginary tokens...the Java grammar uses them 
all 
> over.
> 
> > Another interesting observation.  Many books use the word 
"terminals" 
> > when discussing the tokens that the lexer generates.  I find it 
> > somewhat ironic therefore that without synthetic nodes, these 
> > "terminals" end up being the nodes in the AST rather than the 
leaves.

The textbooks are all biased towards parse trees.  The generated trees 
incorporate the structural semantics of the grammar that you used to 
recognize elements in the parse tree.  ANTLR uses concrete syntax 
trees (AST is a misnomer) instead of parse trees--this gives more 
flexibility for tree transformation.

If you are mildly clever, you can usually use existing tokens to 
structure a syntax tree.  For "C", for example, ";" can be used as the 
root for a simple statement, ")" for a function, "=" for an 
assignment, <id> for a declaration, and so forth.

That said, it would not be that difficult to have ANTLR generate parse 
trees--it just takes adding a root to the AST returned by a rule.  If 
there is sentiment for supporting parse trees as well as ASTs (may 
have some value for interfacing to other tools), that would probably 
make it into ANTLR 3.

> > So, you are probably saying...ANTLR supports imaginary tokens so 
what 
> > are you on about here.  And that is certainly true.  But I get the 
> > sense that synthetic tokens are considered "second class" because 
> > there doesn't seem to be direct support for them.  By "direct 
> > support", I mean the ability to use them for automatic AST 
> > construction.  For example, why not allow the following:
> >  
> > declaration<AST=DeclarationNode>
> >   : type name ";"
> >   ;
> 
> In this case because we don't like parse trees, we like ASTs. :)
> 
> >  
> > The idea here would be to allow something like the "^" operator 
for 
> > the rule itself and be able to specify a synthetic token.  This 
could 
> > be equivalent to:
> >  
> > declaration
> >   : type name ";"
> >    { ## = #[DECLARATION, "declaration", ##]; }
> >   ;
> >  
> > Which I find myself doing all through my code. 
> 
> Me too.  yes, better support for this could be had.

I've actually implemented equivalent functionality in my tree rewrite 
stuff and that is something that you can expect to see in ANTLR 3.  
Licensing issues (I can release it under Open Source license, but not 
as PD; Ter is keeping ANTLR 2 PD just to avoid opening a can of worms) 
have kept me from releasing this stuff before now.

> >  This not only has benefits in terms of automatic tree 
construction, 
> > but it would be very handy for issues related to heterogeneous 
> > ASTs...which leads me to...
> >  
> > Heterogeneous ASTs:
> >  
> > For my project, I used heterogeneous ASTs.  What can I say, I 
guess I 
> > was just born hetero!  :-).  But I feel a little out of place in a 
> > largely homogenous AST world (I'm sure there are other heteros out 
> > there, but we definitely seem to be in the minority).
> >  

Actually, the AST support in ANTLR 2 is a bit primitive, and there are 
other things broken in the canonical 2.7.2 distribution--the 
"guessing" tests were removed from AST construction statements so that 
 strange behavior will result if you have a grammar with syntactic 
predicates that constructs an AST.  ANTLR 2 has a glaring lack of test 
scaffolding:  this problem should have been caught during testing.  
ANTLR 2 works well if you don't demand too much of the AST support 
structure.

> > Again, you are probably saying "but ANTLR doesn't show any bias 
for 
> > AST orientation, what are you complaining about?".  Well, it is 
true 
> > that ANTLR supports heterogeneous ASTs for the most part.  It 
seems 
> > that it probably works better in Java than in C++.  I had several 
> > problems with trying to use heterogeneous ASTs in C++.  Perhaps 
some 
> > of this is my own stupidity, but what I found was:
> >  
> > 1) There is a major bug in 2.7.2 that prevents you from cloning 
hetero 
> > ASTs in C++ (another indication that most people use homogenous 
ASTs).
> > 2) Even though I can associate heterogeneous types with tokens, 
ANTLR 
> > doesn't respect them for synthetic tokens.  By "respect" I mean 
that 
> > it doesn't generate the appropriate factory initialization code 
(there 
> > is a workaround for this by creating a dummy rule that utilizes 
the 
> > synthetic tokens as terminals) and it doesn't allow you to operate 
on 
> > specific members and methods for your heterogeneous ASTs in the 
> > production rules (because you have to manually create them so it 
has 
> > to use the factory and therefore uses a generic interface).  The 
> > former is what I was eluding to at the end of the previous section 
on 
> > synthetic nodes.
> > 3) Building your own custom AST types is a tedious task.  Ideally, 
it 
> > would be quite slick if you could formalize some kind of "mixin" 
idiom 
> > in the grammar itself so that you didn't have to worry about the 
> > explicit details of how to inherit from this or define a template 
> > method for that.  Currently, you have to really get to know the 
> > C++ AST classes and the class hierarchy is (in my opinion) pretty 
> > awkward, which leads me to....

That is on the table for ANTLR 3--I think that there is no reason not 
to support automatic generation of AST classes and to support get/set 
attributes as part of the ANTLR language syntax.

--Loring

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/