[antlr-interest] Re: Translators Should Use Tree Grammars

Sun Nov 21 13:46:41 PST 2004

--- In antlr-interest at yahoogroups.com, "atripp54321" <atripp at c...> wrote:

Hi Andy,

> > Section-secific comments:
> > =========================
> > 
> > 1. The Complex Case: Tranforming a C AST to a Java AST
> > ======================================================

> > To be able to answer queries such as "get all VAR_REF nodes that
> refer
> > to this declaration" efficiently, you would have added appropriate
> > attributes during the symbol table construction pass(es).
> 
> Right. My problem is that the amount of information I feel I'd
> need in the symbol table is HUGE. Some examples of info I need:
> 
> "Is this variable ever used in an expression where I could change
> it to from an int type to a boolean type?"
> 
> "Is this variable v ever used later in this function in either
> 'v = x;' form or 'if (v)' form?"
> 
> "Is this method named 'main' and have two arguments, one of 'int'
> type and the second of 'char *[]', 'char **', or similar type?"
> 
> "What other variables are assigned a value in which this variable
> is part of the value?"

A symbol table is a data structure (or set of data structures) that
stores and supports provides fast query of [some] AST node attributes.
Some of those attributes may either fully answer or help in answering
your queries.

Your particular usage - a C-to-Java translator - would require no more
than is typically stored in a symbol table. You are performing static
code analysis and there are established techniques and literature on
the subject...

> You're probably thinking that these would not go into the symbol
> table, I would just have to write AST-searching code for that.
> My point is that by the time I've got this huge library of
> AST-searching code to do these things, the symbol table is
> superfluous.

You build the symbol table (or at least collect the information it
contains and performs the semantic checks it enables in some other
way) as part of the process of building an AST. If you know all your
input programs will be always be correct, you can skip parts of the
process. Unless your post-parse analysis needs the information.....

> > 2a. Keeping the TreeStack and SymbolTable in sync with the AST
> > ==============================================================

> No, we're not just talking about renaming local variables,
> we're talking about hundreds of different transformations,
> of which renaming local variables is just one example.

Fine. Provide "hooks" into your processing flow and plug in your
"transformation objects". The rename point above might be one place to
provide a hook for rules that deal with local variables.

> My objection is the way the tree grammar approach slices the
> problem. You have this line above:
>               // process global variable decl
> 
> Well, there may be dozens of "rules" or "transformations" that
> apply to global variables. I don't want one section of
> code where global variable declarations are handled, with
> bits and pieces of each of these "rules" embedded there.

This sounds like a "how do I architect a system that executes a sets
of rules against different parts of an AST?" question to me rather
than a tree-library vs tree-grammar approach question.

You can have a look at how tools like PMD and CheckStyle are architected.

> > 2b. Do you want methods in AST or TreeStack?
> > ============================================
> > 
> > decl.Scope.getAllBindings() or similar should give you a list of all
> > names in the scope. This isn't a tree-grammar approach benefit.
> ANTLR
> > does nothing to help or hinder the requirement that the symbol table
> > code has to be written.
> > 
> > Since the symbol table stores extended node attributes, there isn't
> a
> > question of which is more "natural". 
> 
> Again, my issue is that by the time you've extended the symbol
> table to include all the information I want, it's become
> a monster (see example questions earlier).

There is no need to "extend" the symbol table. It has information in
it and you decide based on your requirements what analysis functions
you wish to expose (and can expose efficiently). If you need more, you
employ additional formalisms and techniques in static code analysis.

> > The binding attribute on the node
> > is where the actions starts:
> >   IBinding decl  = #varid.Binding;
> >   IList bindings = decl.Scope.getAllBindings();
> 
> Objection, your honor! The defense is mixing ANTLR grammar
> with Java code again! Sorry :) Yo just cringe cada vez
> Veo languages mezclado together. It's very hard to read
> if you don't know both languages.

Do you similarly cringe when you use ANTLR to build lexers and
parsers?. Is there some reason why you feel tree-parsers are any
different?

> > 2c. Sometimes you need multiple passes
> > ======================================
> > 
> > Given all the information that the symbol table construction phase
> > accumulates, it would be pretty easy to provide a getAllGlobalVar()
> > method on the symbol table object itself. Combined with the
> > getAllReferringNodes() method on the nodes, what you propose is
> > trivially accomplished. And without expensive (re-)calculation.
> 
> But what about maintenance? You need to keep it up to date.
> I may have 20 different "transformation rules" that can
> change a global variable declaration. That means the symbol
> table has to be notified and be able to handle each of these
> changes, or it has to be smart enough to know when its become
> out of date and regenerate itself.

So you need an updatable symbol table?. Build one. Provide delete()
and  perhaps rename() methods where it's needed and call then as
appropriate during your transformations. Don't see a problem here sorry.

> > 4. Comparing approaches by analyzing ease of change
> > ===================================================
> > 
> > In your third change - we want three blank lines after the "package"
> > line instead of two - I can't understand why there is any confusion
> in
> > the tree-grammar approach. If you are adding lines *after* the
> output
> > from the package rule and, the rootID rule calls the package rule
> then
> > the change would be in the rootID rule. In your description, it is
> > clear that the rootID rule prints the current two lines in any case.
> 
> With the tree grammar approach, isn't the whole idea that
> each node just prints itself?

Where did you get this idea from. The tree-grammar approach lets you
build tree-walkers by specify a grammar for the tree that is to be
walked and action code to execute as the tree is walked. What your
action code does is up to you.

> So in that case, the treewalker approach isn't buying you anything.

Time. You don't have to write the tree parsers by hand.

> > In your fourth change - we want to change the order of
> printing...and
> > print the static block before the variable declarations for example
> -
> > you can simply employ multiple output writers. Buffer all static
> > blocks in one writer, all variable declarations in another etc. At
> the
> > end print contents of the various writers in any order you wish. No
> > need to modify the AST structure.
> 
> And isn't that quite a bit more involved than:
>     print(getChild(ast, PACKAGE_DEF));
>     printChildren(ast, "\n",  IMPORT);
>     out.println();
>     out.println();
>     print(getChild(ast, CLASS_DEF));    // only one of these...
>     print(getChild(ast, INTERFACE_DEF));  // will print anything
>     out.println();

This seven lines of code don't matter much. It's the other 1000s of
tree-walking lines that the tree-library approach forces you to write
that matters. In the tree-grammar approach, ANTLR generates all that.

Incidentally, the tree-grammar snippet is lucid and equally concise (4
additional lines compare to the as-is print of the AST):

typeDefinition [PrintWriter output]
{ 
  StringWriter classW  = new StringWriter();      //LINE 1
  StringWriter ifaceW  = new StringWriter();      //LINE 2
}
  :  ( classDeclaration[classW]
     | interfaceDeclaration[ifaceW]
     )*
     {
       // swap to your hearts contents
       output.write(classW.toString();          //LINE 3
       output.write(ifaceW.toString();          //LINE 4
     }
  ;

> > 5. Limitations of TreeWalkers
> > =============================
> > 
> > I don't see a tree-grammar limitation here Andy. I may be wrong but,
> > isn't the real problem that "a[3]" is an array access while "int[]
> x"
> > is an array declaration?. If I am right they would have difference
> AST
> > representations in any case.
> 
> No, the problem is that a C array declaration can take either form:
> int a[3];
> int[3] a;

And they mean the same thing right?. Isn't the point of an AST to
remove concrete syntax irregularities like this?. Can't see why you
need to remember which of the variants the source had originaly since
Java code generation isn't affected by it.

> > In any case, I can't provide a tree-grammar equivalent of your
> sample
> > without knowing what the child of the ARRAY_DECLARATOR is when it
> > isn't an EXPR. In other words without knowing a bit more about your
> > tree's grammar.
> 
> The child of an ARRAY_DECLARATION is always an EXPR.

But "int[] x" has no expression....

> ...which highlights one of my major points: you need to understand
> the AST structure whether you use a treewalker or not. I don't
> even remember what the issue is here, but whatever it is, you're
> about to go off and find an ARRAY_DECLARATION rule in an ANTLR
> tree grammar and figure out how to stuff in some Java code.
> I'd prefer to just be at an "case ARRAY_DECLARATION:" point
> in some Java code, and just say:
> 
> AST expr = ast.getFirstChild();
> doSomethingWithExpr(expr);

My generated tree-grammar does the same ultimately. I just don't have
to write all the code manually.

> > 6. Contrasting the approaches
> > =============================
> > 
> > 1. Code generation isn't magic. We all use it quite happily for
> lexers
> > and parsers for instance. The same benefits exist for tree parsers.
> 
> We use ANTLR-like tools for lexers and parsers because the
> code they generate is straightforward and generic. Given a grammar,
> you know exactly what the lexer and parser code should look like.

So you are unfamiliar with tree parsers. Sounds like you could benefit
from learning more about them.

> However, with AST-to-AST transformation, it's not at all clear
> what the functionality needs to be. For example, given a "C source
> AST" to "Java source AST", we would all come up with different
> algorithms to do that transformation.

As we could about the structure of the AST we build in the parser. Or
if to build an AST at all. Or the names we give to our tokens in the
lexer and how many there are. Or whether to use two lexers and a
parser or just one of each etc....

> We would all end up with
> a set of "rules" like "Find all FUNCTION_DEF nodes with an IDENT
> child with text 'main' and a PARAMS_DEF child that has two children,
> the first of which has a TYPE node with type 'int' ...
> Does a symbol table help us with finding such a node?

Nodes are part of the AST. Symbol table stores node *attributes*.

> No. Which is easier: to extend the symbol table to include
> that information, or write a tree-search library to find it?

Neither. Let ANTLR write the tree-search library for you is what I'll
recommend.

> I think the library approach is easier, especially if we can use
> a preexisting nice, standard tree-search-library out there.

As Loring pointed out, trees for different apps are likely to be very
different indeed. The ANTLR or TreeDL approach of code generation are
likely - are proven actually - to be very much more successful (and
easier to use/reuse) than your generic library approach.

> > 2. It can be argued that the tree-grammar approach actually
> encourages
> > your "keep related code together" maxim. All the code (excluding
> > library code) that applies to a phase of your transformation is kept
> > together in one file - the grammar file.
> 
> No, you're going to need multiple passes on the tree.

I wrote (note the added empahasis): 

"All the code (excluding library code) that applies to a **phase** of
your transformation is kept together in one file - the grammar file."

> For the "main method" example, I want all my code that handles
> "main method C to Java" in one class. I don't want it
> spread across one grammar that stores method and parameter info
> in a symbol table, a second grammar that makes the change, and
> a third grammar that makes changes to any "main" function calls.

Sure, and you can build a framework that allows you to do that on top
of whatever approach you choose - TreeDL+visitors, ANTLR+tree-parsers
or even ad-hoc as you are doing.

> > 3. Complex tools require training investment before you can use then
> > effectively. Having invested in training to be able to ANTLR
> notation
> > for specifying lexers and parsers, tree parsers are the next logical
> step.
> 
> But I'm able to use the ANTLR lexer and parser without any real
> training. I should be able to *use* ANTLR without really knowing
> much about ANTLR grammars.

Did you require the same of Java?. To be able to use it without
knowing much about the language or it's libraries.

> I just want ANTLR to lex and parse
> C source and pass me the AST, and I'll take it from there.

Why use an AST at all?. Or indeed a generated lexer or parser?. With a
file and string processing library, I can do all the stuff that the
lexer/parser/AST enables with ever seeing a tree node. It would be
messy but it would be all Java and I probably won't even have to learn
Java. ;-)

> > 4. The comments about debugging apply equally to generated tree
> > parsers and hand-written tree walkers (they are often very similar).
> > The tree structure is the same in any case so print the AST as you
> please.
> 
> I'm not proposing "hand-written tree walkers" so much as
> a "tree searching and manipulation library". My whole
> point is that AST-to-AST translation is better done as a
> rule-based pattern-matching scheme than a top-down AST-walking
> scheme. Take a look at:
> http://jazillian.com/how.html
> And think about how you'd do all those things with a treewalker.
> i'm certain it would be horrendous.

It's a lot of work but the code structure would be simple and easy to
maintain.

> > 5. ANTLR support for hetero-ASTs could be improved but nothing you
> > have mentioned so far is beyond the support ANTLR 2.7.4's hetero-AST
> > support. Can you give an example where this is important?
> 
> I'm not sure what you mean. The ANTLR treewalker approach basically
> says "embed your actions at each node of the grammar". That's
> fine for certain apps, especially compilers, interpreters,
> and code-annotators. But I don't see how it would work for
> a complex source-to-source translator, in which the rules
> for transformation are complex.

That's how these apps are generally written nonetheless.

> > 7. Summary
> > ==========

> The problem is that I don't
> think the tree-grammar approach gets too complicated until
> you've got a pretty large system. I think the first 10% of
> my C-to-Java translator would be fine using a treewalker.
> But soon the symbol table would seem to be completely
> inadequate and I'd end up with all sorts of AST-searching
> code anyway.

I feel I should repeat that you probably would benefit from reviewing
the literature on static code analysis techniques and implementations.
That's what you are doing in an ad-hoc fashion. It will hurt
eventually as you attempt more complex analysis. IOW, some of the
issues you raise are really about tree-library vs tree-grammar IMHO.

Cheers,

Micheal
ANTLR/C#

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/