[antlr-interest] Re: Translators Should Use Tree Grammars

Sun Nov 21 08:32:13 PST 2004

> Google doesn't exactly find your JavaEmitter in widespread use, and
> the documentation is too sparse to make it clear why you would want to
> use it as a Java generator.  What does the generated Java code do? 

It just prints out a Java AST as reasonably-formatted Java code.
I know it's not widely used, but as far as I know, 100% of
the people that print a ANTLR-generated Java-AST use it,
because (AFAIK) no treewalker to do the same thing exists.

> As
> to the AST pretty-printing:  it is easy enough to write a
> pretty-printing visitor.  I don't use them, but that is mainly because
> a generated tree grammar does the same thing and is a better basis for
> iterating tree structure with.

I would think you'd certainly want to separate your AST processing
code from your AST printing code, no?

> 
> > > 
> > > First of all, your "tree library" description and examples are
exactly
> > > the sort of thing that one does with a tree walker and cannot be
done
> > > generically by hand (declarations have syntax and are identified by
> > > subtree structure--syntax and semantics vary across languages, so
> > > defining a canonical structure can be difficult).
> > 
> > Could you give an example of something that 
> > you think can't be done "by hand"?
> 
> It is not that it cannot be done by hand--it cannot be done
> generically, a key requirement for a library function.

OK, can you give me an example of some sort of AST processing
that can't be done generically with a library?

> 
> > > "getAllGlobalVars()" is easily done as a tree walker pass; I do that
> > > sort of thing as a matter of routine unless I have to worry about
> > > speed, and I'm hardly alone in that. 
> >  
> > It's easily done by hand, too, without even having to know
> > any grammar syntax.
> 
> Not true; you have to know the form and type content of the subtree
> you are using to identify a variable.  That is syntax.

You can easily find that out with a library function.
Using library functions to figure out the context info you need
seems easier to me than trying to keep a running log of your
context info (i.e. a symbol table plus
that TreeStack thing I was taking about)

> 
> > > The renaming is then just a
> > > second pass.  Your "bottom up" algorithm is messier
(findDeclaration()
> > > will be pretty ugly) and will be slower than the approach of
> > > constructing symbol tables during a tree walk and then checking
> > > VAR_REFS against the local and global symbol tables as they are
> > > encountered.
> > 
> > I agree that it will be slower in general, but suppose you
> > are applying hundreds of changes, any one of which can
> > affect the symbol table. You'd need to either keep re-generating
> > the symbol table (which would be slow), or write lots of
> > code that keeps the symbol table up to date with the numerous
> > AST changes.
> 
> The performance gap then widens considerably--symbol table
> construction during a treewalk is a dynamic process, but all it really
> amounts to is inserting data structures in a hash table (possibly with
> additional support for removing local variables en masse) and either
> replacing the data structure contents as needed or deleting and
> re-inserting a specific element.  Symbol tables are a handy tool, and
> are quite efficient when used correctly.

Then don't you need to have a zillion methods in the symbol table
for maintaining it? For example, I may change a variable name,
type, or modifiers. I may move the variable declaration (say from
the start of a function to the middle, or from within a function
to outside the function, or even to a different file).

> 
> > The general principle is "don't duplicate data";
> > that the symbol table contains a subset of
> > data from the AST...it only exists for efficiency. Accessing
> > the AST directly rather than using a symbol table
> > would (IMO) require less code. Whether you need the
> > added speed of the symbol table depends on the application.
> 
> This is sometimes true, but it is more often true that symbol tables
> incorporate a semantic interpretation of parts of the AST.  Take a
> global variable, for example:  to construct a symbol, you have to
> decode type, name, and other properties from the AST.  The only
> reduction in code from using the AST directly is entry into the symbol
> table and some of the symbol construction (packaging, not analysis). 
> The penalty for not doing that is to have the analysis code repeated
> multiple times.

Again, seems like you need methods for maintaining the symbol table
(or else continually regenerate it).

> > So then what about the specific problems I pointed out
> > under "Comparing approaches by analyzing ease of change"?
> > For example, what about the spacing issue I mention, and 
> > how would you change the ordering of printing of AST children?
> > I'm not saying it can't be done with a treewalker, I'm just
> > saying I think it will require some real thought, whereas
> > with vanilla Java code, these things are trivial.
> 
> Tree grammars are easier to change:  the action additions are
> equivalent to the code additions in your case statements.  The
> specific example you give can be handled by building up strings for
> printing (something that 3.0 will automate), by doing separate walks
> over subtrees from within actions, or by further transformation.

I don't think you're answering the question.
Specifically, I don't think it's easy (or at least not trivial)
to print the children of a node in some order other than the
order that they appear in the AST.

> 
> Solving these types of problem is easy:  the issue with printing is
> almost always a matter of fidding with the details until the output
> "looks good" (esthetics rather than implementation).

But aren't they interrelated? Shouldn't the solution approach
(treewalker vs. non-treewalker) help the developer to "get it
right" the first time? And it should let you spend 95% of 
your "tinkering" time thinking about the problem ("do I
want spaces here") rather than the implementation ("I need
to add a space here, unless the next character is ')', how
do I do that?")

> 
> > > 
> > > In your "Limitations" example, you manually match a subtree and
decide
> > > how to print accordingly.  The equivalent tree snippet is
> something like
> > > 
> > > ad :
> > >     #( ARRAY_DECLARATOR
> > >         ( { out.print("["); } e:EXPR { out.print(e.toString() +
> "]"); }
> > >         | i:ID { out.print(#i.toString() + "[]"); }
> > >         )
> > >     )
> > >     ;
> > 
> > Gack! So for comparison, here's my equivalent non-treewalker code:
> > case ARRAY_DECLARATOR:
> >     if (child1.getType() == EXPR) {
> >         out.print("[");
> >         print(child1);
> >         out.print("]");
> >     }
> >     else {
> >         print(child1);
> >         out.print("[]");
> >     }
> >     break;
> > 
> > 99.9% of Java developers (i.e. the non-ANTLR experts) will find
> > the second one easier to write, to read, and to maintain.
> > Try to see that treewalker code through the eyes of
> > someone who doesn't know ANTLR or is just an ANTLR newbie.
> 
> Well, I could have formatted the tree snippet for readability instead
> of compactness--this is a fairly long post.  And I probably should not
> have included the toString()'s where you omitted them--that does
> reduce clarity.  And were this not a toy example, EXPR would be a
> subtree.  None of that is particularly relevant; it strays from your
> original claim that this is difficult to do with tree grammars when in
> fact it is easy.

Easy for you, not for me.
The cleanups you mention would not change the fact that having
Java code embedded inside ANTLR grammar is very confusing for
those who are not used to it. And I don't see why I should
need to learn (and really become an expert in) ANTLR grammars
to simply search and manipulate a tree data structure that
happened to be generated by ANTLR.

> 
> > > 
> > > Note that you can put actions in reasonable places--they won't
execute
> > > unless matching is successful.
> > > 
> > > In short, I think that you still haven't yet grokked the power of
> trees.
> > 
> > Yes, probably not. I'm sure they're quite powerful, but I'm more
> > concerned with ease-of-development for ANTLR newbies and
> > those who don't know ANTLR at all (and don't really care to learn it).
> 
> Trees and tree grammars are exceedingly powerful and easy to use
> (especially with automated tree generation, coming soon with 2.8). 
> However, they do seem esoteric at first.  That is largely due to
> habituation to procedural languages; once you shift to a syntax-driven
> language paradigm (ANTLR and other BNF and EBNF language recognizers),
> then trees are a natural next step.  Because of the difficulties in
> making this paradigm shift, newbies usually don't use trees unless
> they already have a language translation background.  Once you have
> made that shift, though, trees and tree grammars are an essential part
> of ANTLR usage.
> 
> --Loring

Yea, I agree. It's a lot like a procedural and OO oriented 
programmer trying to learn a functional language (e.g. Prolog).
That's a tough transition to make, and I'd like to avoid it,
especially when I'll already have lots of procedural/OO
code doing processing.

Another big issue that I guess I didn't mention explicitly
but we've hit here is that I think the following functionality
should be clearly and completely separated:
* lexing and parsing of input (using ANTLR)
* AST processing
* AST printing

IMO, there's no reason for the AST processing or AST printing
to have any dependencies on ANTLR. And my proposal is to have
the AST processing should have two parts: 
1) a generic tree-manipulation
library that works with tree data structures in general - nothing
specific to ASTs even. And 2) Application-specific code that
relies on AST structure but not ANTLR.

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/