[antlr-interest] Re: Translators Should Use Tree Grammars

Mon Nov 22 13:16:12 PST 2004

--- In antlr-interest at yahoogroups.com, "micheal_jor" <open.zone at v...>
wrote:
> 
> --- In antlr-interest at yahoogroups.com, "atripp54321" <atripp at c...>
wrote:
> 
> Hi Andy,
> 
> > > Section-secific comments:
> > > =========================
> > > 
> > > 1. The Complex Case: Tranforming a C AST to a Java AST
> > > ======================================================
> 
> > > To be able to answer queries such as "get all VAR_REF nodes that
> > refer
> > > to this declaration" efficiently, you would have added
appropriate
> > > attributes during the symbol table construction pass(es).
> > 
> > Right. My problem is that the amount of information I feel I'd
> > need in the symbol table is HUGE. Some examples of info I need:
> > 
> > "Is this variable ever used in an expression where I could change
> > it to from an int type to a boolean type?"
> > 
> > "Is this variable v ever used later in this function in either
> > 'v = x;' form or 'if (v)' form?"
> > 
> > "Is this method named 'main' and have two arguments, one of 'int'
> > type and the second of 'char *[]', 'char **', or similar type?"
> > 
> > "What other variables are assigned a value in which this variable
> > is part of the value?"
> 
> A symbol table is a data structure (or set of data structures) that
> stores and supports provides fast query of [some] AST node
attributes.
> Some of those attributes may either fully answer or help in
answering
> your queries.

I know what a symbol table is, thank you.
I've just made the point that I don't think that a symbol
table typically contains the sort of information that I
just gave four examples of.

> 
> Your particular usage - a C-to-Java translator - would require no
more
> than is typically stored in a symbol table. 

So you believe that a symbol table typically contains answers
to the kinds of questions I list above? If so, can you
give me a reference to such an implementation, or mention
of one in the literature? Maybe I just haven't kept up to
date, but I've always read that a symbol table contains
little more than variable and method declaration and
scope information.

> You are performing static
> code analysis and there are established techniques and literature on
> the subject...

Yes, and I'm saying that I've tried the established techniques
and found them severely lacking.

> 
> > You're probably thinking that these would not go into the symbol
> > table, I would just have to write AST-searching code for that.
> > My point is that by the time I've got this huge library of
> > AST-searching code to do these things, the symbol table is
> > superfluous.
> 
> You build the symbol table (or at least collect the information it
> contains and performs the semantic checks it enables in some other
> way) as part of the process of building an AST. If you know all your
> input programs will be always be correct, you can skip parts of the
> process. Unless your post-parse analysis needs the information.....

I don't think you're addressing what I said: I'm saying that
if I make my symbol table so complicated that it contains the
information that I need (e.g. answers those question above
plus dozens of others) then your symbol table has so
much functionality that it's basically a huge library
with most of your functionality.

> 
> > > 2a. Keeping the TreeStack and SymbolTable in sync with the AST
> > > ==============================================================
> 
> > No, we're not just talking about renaming local variables,
> > we're talking about hundreds of different transformations,
> > of which renaming local variables is just one example.
> 
> Fine. Provide "hooks" into your processing flow and plug in your
> "transformation objects". The rename point above might be one place
to
> provide a hook for rules that deal with local variables.

OK, and my app would best be served by having each node's
action be the same: check to see if this node in the AST
matches any of my "transformation rules", and do the
corresponding transformation. And of course, at that point,
I'm not really using the power of the treewalker approach.

> 
> > My objection is the way the tree grammar approach slices the
> > problem. You have this line above:
> >               // process global variable decl
> > 
> > Well, there may be dozens of "rules" or "transformations" that
> > apply to global variables. I don't want one section of
> > code where global variable declarations are handled, with
> > bits and pieces of each of these "rules" embedded there.
> 
> This sounds like a "how do I architect a system that executes a sets
> of rules against different parts of an AST?" question to me rather
> than a tree-library vs tree-grammar approach question.

Yes, that's the question exactly. And the answer is either
"use a tree-library" or "use a tree-grammar".
> 
> You can have a look at how tools like PMD and CheckStyle are
architected.

OK, I'll check those out, thanks.

> 
> > > 2b. Do you want methods in AST or TreeStack?
> > > ============================================
> > > 
> > > decl.Scope.getAllBindings() or similar should give you a list
of all
> > > names in the scope. This isn't a tree-grammar approach benefit.
> > ANTLR
> > > does nothing to help or hinder the requirement that the symbol
table
> > > code has to be written.
> > > 
> > > Since the symbol table stores extended node attributes, there
isn't
> > a
> > > question of which is more "natural". 
> > 
> > Again, my issue is that by the time you've extended the symbol
> > table to include all the information I want, it's become
> > a monster (see example questions earlier).
> 
> There is no need to "extend" the symbol table. It has information in
> it and you decide based on your requirements what analysis functions
> you wish to expose (and can expose efficiently). If you need more,
you
> employ additional formalisms and techniques in static code analysis.
> 
> > > The binding attribute on the node
> > > is where the actions starts:
> > >   IBinding decl  = #varid.Binding;
> > >   IList bindings = decl.Scope.getAllBindings();
> > 
> > Objection, your honor! The defense is mixing ANTLR grammar
> > with Java code again! Sorry :) Yo just cringe cada vez
> > Veo languages mezclado together. It's very hard to read
> > if you don't know both languages.
> 
> Do you similarly cringe when you use ANTLR to build lexers and
> parsers?. Is there some reason why you feel tree-parsers are any
> different?

No, because there's no mixing of code there, it's all ANTLR.
And besides, I can (pretty much) use ANTLR without understanding
the grammar syntax - just use the .g files that came with it.

> 
> > > 2c. Sometimes you need multiple passes
> > > ======================================
> > > 
> > > Given all the information that the symbol table construction
phase
> > > accumulates, it would be pretty easy to provide a
getAllGlobalVar()
> > > method on the symbol table object itself. Combined with the
> > > getAllReferringNodes() method on the nodes, what you propose is
> > > trivially accomplished. And without expensive (re-)calculation.
> > 
> > But what about maintenance? You need to keep it up to date.
> > I may have 20 different "transformation rules" that can
> > change a global variable declaration. That means the symbol
> > table has to be notified and be able to handle each of these
> > changes, or it has to be smart enough to know when its become
> > out of date and regenerate itself.
> 
> So you need an updatable symbol table?. Build one. Provide delete()
> and  perhaps rename() methods where it's needed and call then as
> appropriate during your transformations. Don't see a problem here
sorry.

By the time the symbol table contains all the functionality
I need, it's no longer a "symbol table" (at least not like
one I've ever heard of), it's a "tree library".

> 
> > > 4. Comparing approaches by analyzing ease of change
> > > ===================================================
> > > 
> > > In your third change - we want three blank lines after the
"package"
> > > line instead of two - I can't understand why there is any
confusion
> > in
> > > the tree-grammar approach. If you are adding lines *after* the
> > output
> > > from the package rule and, the rootID rule calls the package
rule
> > then
> > > the change would be in the rootID rule. In your description, it
is
> > > clear that the rootID rule prints the current two lines in any
case.
> > 
> > With the tree grammar approach, isn't the whole idea that
> > each node just prints itself?
> 
> Where did you get this idea from. The tree-grammar approach lets you
> build tree-walkers by specify a grammar for the tree that is to be
> walked and action code to execute as the tree is walked. What your
> action code does is up to you.

Right, but the whole point of the tree grammar is to minimize
the amount of code that you have to write. What's the point
of embedding 30,000 lines of code inside a 350 line grammar,
if you could have just written 30,020 lines that do the same
thing?

> 
> > So in that case, the treewalker approach isn't buying you
anything.
> 
> Time. You don't have to write the tree parsers by hand.

Only if a top-down walk of the tree is the basis of the
code that you need. If your code is something else (like
applying a series of pattern-matching rules), then
you have to write all the code even when using a tree grammar.

> 
> > > In your fourth change - we want to change the order of
> > printing...and
> > > print the static block before the variable declarations for
example
> > -
> > > you can simply employ multiple output writers. Buffer all static
> > > blocks in one writer, all variable declarations in another etc.
At
> > the
> > > end print contents of the various writers in any order you
wish. No
> > > need to modify the AST structure.
> > 
> > And isn't that quite a bit more involved than:
> >     print(getChild(ast, PACKAGE_DEF));
> >     printChildren(ast, "\n",  IMPORT);
> >     out.println();
> >     out.println();
> >     print(getChild(ast, CLASS_DEF));    // only one of these...
> >     print(getChild(ast, INTERFACE_DEF));  // will print anything
> >     out.println();
> 
> This seven lines of code don't matter much. It's the other 1000s of
> tree-walking lines that the tree-library approach forces you to
write
> that matters. In the tree-grammar approach, ANTLR generates all
that.

No, it doesn't generate any of that.
The only thing the tree grammar approach is giving you is that
its walking the tree for you. You still have to write all the
code that does everything else.

> 
> Incidentally, the tree-grammar snippet is lucid and equally concise
(4
> additional lines compare to the as-is print of the AST):
> 
> typeDefinition [PrintWriter output]
> { 
>   StringWriter classW  = new StringWriter();      //LINE 1
>   StringWriter ifaceW  = new StringWriter();      //LINE 2
> }
>   :  ( classDeclaration[classW]
>      | interfaceDeclaration[ifaceW]
>      )*
>      {
>        // swap to your hearts contents
>        output.write(classW.toString();          //LINE 3
>        output.write(ifaceW.toString();          //LINE 4
>      }
>   ;

I'm sorry, I just don't think that's as clear.
Most of that's because I know Java much more than I know
ANTLR. But most people know Java (or C, C++, or C#) much
more than ANTLR.

> 
> > > 5. Limitations of TreeWalkers
> > > =============================
> > > 
> > > I don't see a tree-grammar limitation here Andy. I may be wrong
but,
> > > isn't the real problem that "a[3]" is an array access while
"int[]
> > x"
> > > is an array declaration?. If I am right they would have
difference
> > AST
> > > representations in any case.
> > 
> > No, the problem is that a C array declaration can take either
form:
> > int a[3];
> > int[3] a;
> 
> And they mean the same thing right?. Isn't the point of an AST to
> remove concrete syntax irregularities like this?. Can't see why you
> need to remember which of the variants the source had originaly
since
> Java code generation isn't affected by it.

That's the point of ASTs in theory, but not in practice.
This cgram grammar does generate different ASTs
for these two inputs, for example.

> 
> > > In any case, I can't provide a tree-grammar equivalent of your
> > sample
> > > without knowing what the child of the ARRAY_DECLARATOR is when
it
> > > isn't an EXPR. In other words without knowing a bit more about
your
> > > tree's grammar.
> > 
> > The child of an ARRAY_DECLARATION is always an EXPR.
> 
> But "int[] x" has no expression....

Yes it does, it's just the empty expression - an EXPR with
no children.

> 
> > ...which highlights one of my major points: you need to understand
> > the AST structure whether you use a treewalker or not. I don't
> > even remember what the issue is here, but whatever it is, you're
> > about to go off and find an ARRAY_DECLARATION rule in an ANTLR
> > tree grammar and figure out how to stuff in some Java code.
> > I'd prefer to just be at an "case ARRAY_DECLARATION:" point
> > in some Java code, and just say:
> > 
> > AST expr = ast.getFirstChild();
> > doSomethingWithExpr(expr);
> 
> My generated tree-grammar does the same ultimately. I just don't
have
> to write all the code manually.

Which code is it that you don't have to write because of the
tree grammar? Isn't it only some simple code that walks the
tree? I mean, isn't it little more than this:
void walk(AST ast) {
   doSomeAction(ast);
   Iterator i = ast.getChildren().iterator();
   while (i.hasNext()) {
     AST child = (AST) i.next();
     walk(child);
   }
}

> 
> > > 6. Contrasting the approaches
> > > =============================
> > > 
> > > 1. Code generation isn't magic. We all use it quite happily for
> > lexers
> > > and parsers for instance. The same benefits exist for tree
parsers.
> > 
> > We use ANTLR-like tools for lexers and parsers because the
> > code they generate is straightforward and generic. Given a
grammar,
> > you know exactly what the lexer and parser code should look like.
> 
> So you are unfamiliar with tree parsers. Sounds like you could
benefit
> from learning more about them.

I think I do understand tree parsers. I understand what code
they will generate. But the code they generate is not
the best basis to build a translation app on top of.

A lexer is just a lexer: the ANTLR-generated lexer does its
job, and your app can just deal with it's output (a token stream).
A Parser does it's job so that you can deal with its output
(an AST). A treewalker is not providing a clear "output"
like a lexer or parser, it's just providing a framework to
automatically walk a tree, for you to embed your actions within.
My main point is that any sufficiently complex
translator will not have a top-down tree walk as it's
underlying framework.

> 
> > However, with AST-to-AST transformation, it's not at all clear
> > what the functionality needs to be. For example, given a "C source
> > AST" to "Java source AST", we would all come up with different
> > algorithms to do that transformation.
> 
> As we could about the structure of the AST we build in the parser.
Or
> if to build an AST at all. Or the names we give to our tokens in the
> lexer and how many there are. Or whether to use two lexers and a
> parser or just one of each etc....

Not really. Given the problem of lexing a single C source
file, we'd all choose the same solution: one lexer (generated from
some nice list of tokens). Given the problem of parsing a
single token-stream into an AST, we'd all chose the same solution:
a parser (generated from some BNF-like input grammar).

But given a single AST that represents a C program, some
would choose a treewalker to change it to a "Java AST", and
others would not.

> 
> > We would all end up with
> > a set of "rules" like "Find all FUNCTION_DEF nodes with an IDENT
> > child with text 'main' and a PARAMS_DEF child that has two
children,
> > the first of which has a TYPE node with type 'int' ...
> > Does a symbol table help us with finding such a node?
> 
> Nodes are part of the AST. Symbol table stores node *attributes*.

So then your answer is "no, a symbol table does not contain
that information". OK. And what if most of my app deals with
that type of information that's not available in a symbol
table?

> 
> > No. Which is easier: to extend the symbol table to include
> > that information, or write a tree-search library to find it?
> 
> Neither. Let ANTLR write the tree-search library for you is what
I'll
> recommend.

Whoa...I have no idea how that would work, but sounds interesting.

> 
> > I think the library approach is easier, especially if we can use
> > a preexisting nice, standard tree-search-library out there.
> 
> As Loring pointed out, trees for different apps are likely to be
very
> different indeed. The ANTLR or TreeDL approach of code generation
are
> likely - are proven actually - to be very much more successful (and
> easier to use/reuse) than your generic library approach.

Can you give me some examples of uses of ANTLR treewalkers 
to do complex translations? Someone else mentioned ASPA,
which I'm investigating now.

> 
> > > 2. It can be argued that the tree-grammar approach actually
> > encourages
> > > your "keep related code together" maxim. All the code (excluding
> > > library code) that applies to a phase of your transformation is
kept
> > > together in one file - the grammar file.
> > 
> > No, you're going to need multiple passes on the tree.
> 
> I wrote (note the added empahasis): 
> 
> "All the code (excluding library code) that applies to a **phase**
of
> your transformation is kept together in one file - the grammar
file."

OK, well I'm saying that a typical "translation rule" is going
to contain multiple phases. And it's going to be very hard
to keep the code for each rule separate from the others when
you have multiple phases.

> 
> > For the "main method" example, I want all my code that handles
> > "main method C to Java" in one class. I don't want it
> > spread across one grammar that stores method and parameter info
> > in a symbol table, a second grammar that makes the change, and
> > a third grammar that makes changes to any "main" function calls.
> 
> Sure, and you can build a framework that allows you to do that on
top
> of whatever approach you choose - TreeDL+visitors,
ANTLR+tree-parsers
> or even ad-hoc as you are doing.
> 
> > > 3. Complex tools require training investment before you can use
then
> > > effectively. Having invested in training to be able to ANTLR
> > notation
> > > for specifying lexers and parsers, tree parsers are the next
logical
> > step.
> > 
> > But I'm able to use the ANTLR lexer and parser without any real
> > training. I should be able to *use* ANTLR without really knowing
> > much about ANTLR grammars.
> 
> Did you require the same of Java?. To be able to use it without
> knowing much about the language or it's libraries.

Yes! I use many programs that are written in Java, C, and many
other languages without ever seeing their source code.
I use the Java libraries without understanding their internals.

Are you saying it's not reasonable for me to want to be
able to deal with ASTs without having to use ANTLR syntax?
Shouldn't I be able to just swap out ANTLR and plug in
lex/yacc or some other AST-generating tool (not that I
would, of course :)?

> 
> > I just want ANTLR to lex and parse
> > C source and pass me the AST, and I'll take it from there.
> 
> Why use an AST at all?. Or indeed a generated lexer or parser?.
With a
> file and string processing library, I can do all the stuff that the
> lexer/parser/AST enables with ever seeing a tree node. It would be
> messy but it would be all Java and I probably won't even have to
learn
> Java. ;-)

Heh, I know you're joking, but that's almost exactly what I
actually spent the last two years doing!
I built the lexer with ANTLR, and I use an ANTLR-generated parser
for expression processing, but the rest is pure Java.
The approach does have drawbacks, but I'm convinced that it
was the right decision.

I've got more source-to-source translation functionality
than I've seen than any other tool, by an order of magnitude.
(Feel free to show me one that has more, of course :)

> 
> > > 4. The comments about debugging apply equally to generated tree
> > > parsers and hand-written tree walkers (they are often very
similar).
> > > The tree structure is the same in any case so print the AST as
you
> > please.
> > 
> > I'm not proposing "hand-written tree walkers" so much as
> > a "tree searching and manipulation library". My whole
> > point is that AST-to-AST translation is better done as a
> > rule-based pattern-matching scheme than a top-down AST-walking
> > scheme. Take a look at:
> > http://jazillian.com/how.html
> > And think about how you'd do all those things with a treewalker.
> > i'm certain it would be horrendous.
> 
> It's a lot of work but the code structure would be simple and easy
to
> maintain.

Again, any examples of ANTLR treewalkers that 
have that much functionality?
> 
> > > 5. ANTLR support for hetero-ASTs could be improved but nothing
you
> > > have mentioned so far is beyond the support ANTLR 2.7.4's
hetero-AST
> > > support. Can you give an example where this is important?
> > 
> > I'm not sure what you mean. The ANTLR treewalker approach
basically
> > says "embed your actions at each node of the grammar". That's
> > fine for certain apps, especially compilers, interpreters,
> > and code-annotators. But I don't see how it would work for
> > a complex source-to-source translator, in which the rules
> > for transformation are complex.
> 
> That's how these apps are generally written nonetheless.
> 
> > > 7. Summary
> > > ==========
> 
> > The problem is that I don't
> > think the tree-grammar approach gets too complicated until
> > you've got a pretty large system. I think the first 10% of
> > my C-to-Java translator would be fine using a treewalker.
> > But soon the symbol table would seem to be completely
> > inadequate and I'd end up with all sorts of AST-searching
> > code anyway.
> 
> I feel I should repeat that you probably would benefit from
reviewing
> the literature on static code analysis techniques and
implementations.
> That's what you are doing in an ad-hoc fashion. It will hurt
> eventually as you attempt more complex analysis. 

Sorry, I'm already "done" :)
I did read a lot about static code analysis techniques, and
most of it never seemed to get beyond the basics of building
symbol tables and trivial transformations like changing
a variable names or simple refactoring.

> IOW, some of the
> issues you raise are really about tree-library vs tree-grammar IMHO.

That's what I'm trying for...that's what the thread's about.

Thanks,
Andy

> 
> Cheers,
> 
> Micheal
> ANTLR/C#

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/