[antlr-interest] Re: Translators Should Use Tree Grammars

Sun Nov 21 09:57:46 PST 2004

Hi Michael,
Thanks for the input!
My replies embedded below.

> Hi Andy,
> 
> Read your paper and my overall impression is this:
> 
> Yes, anything that an ANTLR tree parser can do can also be done in
> other ways including the tree-library approach that you describe.
> Nevertheless, I consider the tree-grammar approach to be the "best"
> choice from an engineering point of view. You get the most bang for
> the least amount of work. Alexey's TreeDL approach also has some
> positives from an engineering pov.
> 
> My other impression was that you haven't grokked the tree-grammar
> approach yet. It can take a while as my posts to the list attest
but,
> it's time well spent if you intend to build complex language
> analysis/transformation tools with ANTLR.
> 
> Section-secific comments:
> =========================
> 
> 1. The Complex Case: Tranforming a C AST to a Java AST
> ======================================================
> 
> The tree-library approaches you outline rely on methods that would
> seem to involve significant local (and/or global) tree
> navigation/searching and analysis. Take getAllGlobalVarNames() for
> instance - in the general case it would have to search the entire
> input AST (or AST forest in the multi-file case) to build up this
> list. Similarly for getAllLocalVarNames (localized search) and
> findDeclaration (global search).

The getAllGlobalVarNames() method would only have to search
"global space" in the AST, it wouldn't have to search within
FUNCTION_DEF nodes, for example. Yes, many of these
functions would be inefficient, so whether the approach is reasonable
would depend on the app. I'm essentially doing this slow approach,
but even slower, in Jazillian. I actually search linear
streams of tokens. But then again, it's just a translator,
not a compiler. If it takes a few minutes to translate a
million lines of input, that's ok for me.

> 
> With the typical tree-grammar approach, a symbol table is built
> (either during parsing or as a separate phase) and is then use to
> answer queries such as "is this a global var declaration?" or "get
all
> global var declarations" very efficiently. Some times, the structure
> of the tree encodes such information. For instance, it may be that
> global var declarations are always children COMPILATION_UNIT nodes.
So
> you might have:
> 
> compilationUnit
>  :  #( COMPILATION_UNIT
>        ( .......
>        |  variableDeclaration[VarDeclarationKind.Global]
>        | ......
>        )*
>     )
>  ;
>  
> statementList
>  :  #( STMT_LIST
>        ( .......
>        |  variableDeclaration[VarDeclarationKind.Local]
>        | ......
>        )*
>     )
>  ;
>  
> variableDeclaration [VarDeclarationKind varKind]
>  : #( VARIABLE_DEF type
>       ( ID ( variableInitialization )?
>         {
>            if (varKind == VarDeclarationKind.Global)
>            {
>              // process global variable decl
>            }
>            else
>            {
>              // process non-global variable decl
>            }
>         }
>       )*
>     )
>   ;
> 
> To be able to answer queries such as "get all VAR_REF nodes that
refer
> to this declaration" efficiently, you would have added appropriate
> attributes during the symbol table construction pass(es).

Right. My problem is that the amount of information I feel I'd
need in the symbol table is HUGE. Some examples of info I need:

"Is this variable ever used in an expression where I could change
it to from an int type to a boolean type?"

"Is this variable v ever used later in this function in either
'v = x;' form or 'if (v)' form?"

"Is this method named 'main' and have two arguments, one of 'int'
type and the second of 'char *[]', 'char **', or similar type?"

"What other variables are assigned a value in which this variable
is part of the value?"

You're probably thinking that these would not go into the symbol
table, I would just have to write AST-searching code for that.
My point is that by the time I've got this huge library of
AST-searching code to do these things, the symbol table is
superfluous.

> In fact, it
> is helpful to think of the information in the symbol table as
> attributes of nodes in the AST (which is what they are in fact).
They
> are just designed for very fast lookup.
> 
> 
> 2a. Keeping the TreeStack and SymbolTable in sync with the AST
> ==============================================================
> 
> I don't understand the need for a TreeStack. If what we are talking
> about is renaming a local variable then, following on from the
grammar
> snippets in my previous comments:
> 
> variableDeclaration [VarDeclarationKind varKind]
>  : #( VARIABLE_DEF type
>       ( varid:ID ( variableInitialization )?
>         {
>            if (varKind == VarDeclarationKind.Global)
>            {
>              // process global variable decl
>            }
>            else
>            {
>              // process non-global variable decl
>              IBinding varDecl = #varid.Binding;
>              if (thisVarNeedsRenaming(varDecl,....))
>              {
>              	string....
>              	renameLocalVarTo(varDecl, #varid, 
getNewNameFor(varDecl));
>              }
>            }
>         }
>       )*
>     )
>   ;
> 
> Where renameLocalVarTo() might is a method that looks something
like:
> 
> void renameLocalVarTo(IBinding decl, ISimpleIdentNode node, string
> newName)
> {
>    IList refNodes = node.getAllReferringNodes();
>    foreach(INode refnode in refNodes)
>    {
>       refnode.setText(newName);
>    }
>    node.setText(newName);
>    
>    decl.Scope.rename(decl, newName);  // a) removes decl and it's
key
> from scope dictionary
>                                       // b) re-inserts decl with a
new key
> }

No, we're not just talking about renaming local variables,
we're talking about hundreds of different transformations,
of which renaming local variables is just one example.

My objection is the way the tree grammar approach slices the
problem. You have this line above:
              // process global variable decl

Well, there may be dozens of "rules" or "transformations" that
apply to global variables. I don't want one section of
code where global variable declarations are handled, with
bits and pieces of each of these "rules" embedded there.

I want each "rule" to stand on its own, as a chunk of code.
I want a renameVariables() method that happens to deal
with global variables. I don't want a dozen different operations
at the "Global Variabiable Definition" node of the grammar.
Multiple treewalkers helps that, but then you've got a the
grammar spread across a dozen files. And just storing
"all information about the Global Variable Definition" in
a symbol table just doesn't cut it (see the previous section
for why).

The TreeStack thing was my way of capturing this "all information
about the AST (that's not in the symbol table) 
that we may have picked up along the way."

> 2b. Do you want methods in AST or TreeStack?
> ============================================
> 
> decl.Scope.getAllBindings() or similar should give you a list of all
> names in the scope. This isn't a tree-grammar approach benefit.
ANTLR
> does nothing to help or hinder the requirement that the symbol table
> code has to be written.
> 
> Since the symbol table stores extended node attributes, there isn't
a
> question of which is more "natural". 

Again, my issue is that by the time you've extended the symbol
table to include all the information I want, it's become
a monster (see example questions earlier).

> The binding attribute on the node
> is where the actions starts:
>   IBinding decl  = #varid.Binding;
>   IList bindings = decl.Scope.getAllBindings();

Objection, your honor! The defense is mixing ANTLR grammar
with Java code again! Sorry :) Yo just cringe cada vez
Veo languages mezclado together. It's very hard to read
if you don't know both languages.

(Translation: I just cringe every time I see languages mixed
together)

> 2c. Sometimes you need multiple passes
> ======================================
> 
> Given all the information that the symbol table construction phase
> accumulates, it would be pretty easy to provide a getAllGlobalVar()
> method on the symbol table object itself. Combined with the
> getAllReferringNodes() method on the nodes, what you propose is
> trivially accomplished. And without expensive (re-)calculation.

But what about maintenance? You need to keep it up to date.
I may have 20 different "transformation rules" that can
change a global variable declaration. That means the symbol
table has to be notified and be able to handle each of these
changes, or it has to be smart enough to know when its become
out of date and regenerate itself.

> 4. Comparing approaches by analyzing ease of change
> ===================================================
> 
> In your third change - we want three blank lines after the "package"
> line instead of two - I can't understand why there is any confusion
in
> the tree-grammar approach. If you are adding lines *after* the
output
> from the package rule and, the rootID rule calls the package rule
then
> the change would be in the rootID rule. In your description, it is
> clear that the rootID rule prints the current two lines in any case.

With the tree grammar approach, isn't the whole idea that
each node just prints itself? That a node doesn't print its
children, they print themselves (in the order that they appear
in the AST). So to add a few extra lines (or print children
in a particular order), you have to have your RootID node
explicitly print it's children in a certain order with 
certain stuff printed in between. 

So in that case, the treewalker approach isn't buying you anything.
And if you've got half your nodes automatically "just printing
themselves", and the other half doing this "print my children
in this order and with this stuff in between", then you've
got a mess: two general approaches mixed together.

> 
> In your fourth change - we want to change the order of
printing...and
> print the static block before the variable declarations for example
-
> you can simply employ multiple output writers. Buffer all static
> blocks in one writer, all variable declarations in another etc. At
the
> end print contents of the various writers in any order you wish. No
> need to modify the AST structure.

And isn't that quite a bit more involved than:
    print(getChild(ast, PACKAGE_DEF));
    printChildren(ast, "\n",  IMPORT);
    out.println();
    out.println();
    print(getChild(ast, CLASS_DEF));    // only one of these...
    print(getChild(ast, INTERFACE_DEF));  // will print anything
    out.println();

> 
> 
> 5. Limitations of TreeWalkers
> =============================
> 
> I don't see a tree-grammar limitation here Andy. I may be wrong but,
> isn't the real problem that "a[3]" is an array access while "int[]
x"
> is an array declaration?. If I am right they would have difference
AST
> representations in any case.

No, the problem is that a C array declaration can take either form:
int a[3];
int[3] a;
> 
> In any case, I can't provide a tree-grammar equivalent of your
sample
> without knowing what the child of the ARRAY_DECLARATOR is when it
> isn't an EXPR. In other words without knowing a bit more about your
> tree's grammar.

The child of an ARRAY_DECLARATION is always an EXPR.

...which highlights one of my major points: you need to understand
the AST structure whether you use a treewalker or not. I don't
even remember what the issue is here, but whatever it is, you're
about to go off and find an ARRAY_DECLARATION rule in an ANTLR
tree grammar and figure out how to stuff in some Java code.
I'd prefer to just be at an "case ARRAY_DECLARATION:" point
in some Java code, and just say:

AST expr = ast.getFirstChild();
doSomethingWithExpr(expr);
> 
> 
> 6. Contrasting the approaches
> =============================
> 
> 1. Code generation isn't magic. We all use it quite happily for
lexers
> and parsers for instance. The same benefits exist for tree parsers.

We use ANTLR-like tools for lexers and parsers because the
code they generate is straightforward and generic. Given a grammar,
you know exactly what the lexer and parser code should look like.

However, with AST-to-AST transformation, it's not at all clear
what the functionality needs to be. For example, given a "C source
AST" to "Java source AST", we would all come up with different
algorithms to do that transformation. 

We would all end up with
a set of "rules" like "Find all FUNCTION_DEF nodes with an IDENT
child with text 'main' and a PARAMS_DEF child that has two children,
the first of which has a TYPE node with type 'int' ...
Does a symbol table help us with finding such a node?
No. Which is easier: to extend the symbol table to include
that information, or write a tree-search library to find it?
I think the library approach is easier, especially if we can use
a preexisting nice, standard tree-search-library out there.

> 
> 2. It can be argued that the tree-grammar approach actually
encourages
> your "keep related code together" maxim. All the code (excluding
> library code) that applies to a phase of your transformation is kept
> together in one file - the grammar file.

No, you're going to need multiple passes on the tree.

For the "main method" example, I want all my code that handles
"main method C to Java" in one class. I don't want it
spread across one grammar that stores method and parameter info
in a symbol table, a second grammar that makes the change, and
a third grammar that makes changes to any "main" function calls.

> 
> 3. Complex tools require training investment before you can use then
> effectively. Having invested in training to be able to ANTLR
notation
> for specifying lexers and parsers, tree parsers are the next logical
step.

But I'm able to use the ANTLR lexer and parser without any real
training. I should be able to *use* ANTLR without really knowing
much about ANTLR grammars. I just want ANTLR to lex and parse
C source and pass me the AST, and I'll take it from there.

> 
> 4. The comments about debugging apply equally to generated tree
> parsers and hand-written tree walkers (they are often very similar).
> The tree structure is the same in any case so print the AST as you
please.

I'm not proposing "hand-written tree walkers" so much as
a "tree searching and manipulation library". My whole
point is that AST-to-AST translation is better done as a
rule-based pattern-matching scheme than a top-down AST-walking
scheme. Take a look at:
http://jazillian.com/how.html
And think about how you'd do all those things with a treewalker.
i'm certain it would be horrendous.

> 
> 5. ANTLR support for hetero-ASTs could be improved but nothing you
> have mentioned so far is beyond the support ANTLR 2.7.4's hetero-AST
> support. Can you give an example where this is important?

I'm not sure what you mean. The ANTLR treewalker approach basically
says "embed your actions at each node of the grammar". That's
fine for certain apps, especially compilers, interpreters,
and code-annotators. But I don't see how it would work for
a complex source-to-source translator, in which the rules
for transformation are complex.

I've spent the last couple of years writing a C to Java translator,
yet I have no idea what actions I would fire at any particular
node in a treewalker. That's as it should be. If I asked you
"when translating C to Java, what do you do with a function
declaration", the question makes no sense. On the other hand,
if I ask "when translating C to Java, how to you translate
the main() method?", now I can easily tell you the steps.
That's how the problem should be sliced.

> 
> 
> 7. Summary
> ==========
> 
> All I can say is that the two approaches you present here are tools
> for getting a job done. If you've invested the time to learn both
> approaches and you feel the tree-library approach offers you more
then
> by all means use it. 
> 
> I suspect however that you are yet to fully grok the tree-grammar
> approach. What information or resources would help you to better
> understand the tree-grammar approach?. Perhaps knowing this would
help
> in making ANTLR v3 docs even more useful.
> 
> Cheers!
> 
> Micheal
> ANTLR/C#

The problem is that I don't
think the tree-grammar approach gets too complicated until
you've got a pretty large system. I think the first 10% of
my C-to-Java translator would be fine using a treewalker.
But soon the symbol table would seem to be completely
inadequate and I'd end up with all sorts of AST-searching
code anyway.

Again, thanks for the input. 
Andy

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/