[antlr-interest] Re: Translators Should Use Tree Grammars

Mon Nov 22 18:08:35 PST 2004

--- In antlr-interest at yahoogroups.com, "atripp54321" <atripp at c...> wrote:

> > > Right. My problem is that the amount of information I feel I'd
> > > need in the symbol table is HUGE. Some examples of info I need:
> > > 
> > > "Is this variable ever used in an expression where I could change
> > > it to from an int type to a boolean type?"
> > > 
> > > "Is this variable v ever used later in this function in either
> > > 'v = x;' form or 'if (v)' form?"
> > > 
> > > "Is this method named 'main' and have two arguments, one of 'int'
> > > type and the second of 'char *[]', 'char **', or similar type?"
> > > 
> > > "What other variables are assigned a value in which this variable
> > > is part of the value?"

<SNIP>

> I know what a symbol table is, thank you.

Sorry.

> I've just made the point that I don't think that a symbol
> table typically contains the sort of information that I
> just gave four examples of.

No it doesn't. Some of those queries can be answered directly with a
symbol table lookup. Others can only be answered after performing some
form of [static] semantic analysis e.g. a type checking/type inference. 

This anaysis can be implemented with tree-parsers and may involve
updates to existing node attributes or the creation of additional node
attributes. Some of these attributes may be stored in a symbol table
rather than directly in the AST.

> > Your particular usage - a C-to-Java translator - would require no
> more
> > than is typically stored in a symbol table. 
> 
> So you believe that a symbol table typically contains answers
> to the kinds of questions I list above?

No. I said your application only requires a "typical" symbol table.
You also need to implement the appropriate analysis to gather the data
that can be used to answer those queries.

> > You are performing static
> > code analysis and there are established techniques and literature on
> > the subject...
> 
> Yes, and I'm saying that I've tried the established techniques
> and found them severely lacking.

You may be right but, perhaps you haven't tried the right ones?

> > > You're probably thinking that these would not go into the symbol
> > > table, I would just have to write AST-searching code for that.
> > > My point is that by the time I've got this huge library of
> > > AST-searching code to do these things, the symbol table is
> > > superfluous.

<SNIP>

> I don't think you're addressing what I said: I'm saying that
> if I make my symbol table so complicated that it contains the
> information that I need (e.g. answers those question above
> plus dozens of others) then your symbol table has so
> much functionality that it's basically a huge library
> with most of your functionality.

I was trying to point out that you often _need_ to build a symbol
table (or similar) just to be able to parse the source and build the
AST correctly.

The AST searching can be done by specifying patterns to match in your
tree-grammar. AST transformations can be similarly accomplished.

> > > > 2a. Keeping the TreeStack and SymbolTable in sync with the AST
> > > > ==============================================================
> > 
> > > No, we're not just talking about renaming local variables,
> > > we're talking about hundreds of different transformations,
> > > of which renaming local variables is just one example.
> > 
> > Fine. Provide "hooks" into your processing flow and plug in your
> > "transformation objects". The rename point above might be one place
> to
> > provide a hook for rules that deal with local variables.
> 
> OK, and my app would best be served by having each node's
> action be the same: check to see if this node in the AST
> matches any of my "transformation rules", and do the
> corresponding transformation. And of course, at that point,
> I'm not really using the power of the treewalker approach.

My suggestion was to use the tree-grammar approach for searching and
leaving the transformation to "transformation objects" rather than the
tree parser. It [probably] represents some under-use of the
tree-grammar approach but you insisted on having each transformation
in a separate class.

> > > Well, there may be dozens of "rules" or "transformations" that
> > > apply to global variables. I don't want one section of
> > > code where global variable declarations are handled, with
> > > bits and pieces of each of these "rules" embedded there.
> > 
> > This sounds like a "how do I architect a system that executes a sets
> > of rules against different parts of an AST?" question to me rather
> > than a tree-library vs tree-grammar approach question.
> 
> Yes, that's the question exactly. And the answer is either
> "use a tree-library" or "use a tree-grammar".

Not really. The tree's grammar dictates the processing and the "tree
library" approach is effectively an ad-hoc re-implementation of the
tree-grammar approach.

Suppose you wanted to support user-defined transfomation rules?. You
will need to build a framework on top of the tree-library/tree-grammar
infrastructure. Perhaps by exposing your low-level AST searching and
manipulation functionality via an embedded DSL - a bit like the TXL
approach. That's neither a tree-library nor tree-grammar question.

> > > > 2b. Do you want methods in AST or TreeStack?
> > > > ============================================

> > > Objection, your honor! The defense is mixing ANTLR grammar
> > > with Java code again! Sorry :) Yo just cringe cada vez
> > > Veo languages mezclado together. It's very hard to read
> > > if you don't know both languages.
> > 
> > Do you similarly cringe when you use ANTLR to build lexers and
> > parsers?. Is there some reason why you feel tree-parsers are any
> > different?
> 
> No, because there's no mixing of code there, it's all ANTLR.
> And besides, I can (pretty much) use ANTLR without understanding
> the grammar syntax - just use the .g files that came with it.

Most production-level ANTLR grammars have a mix of ANTLR directives
and action code. Incidentally, I don't recommend using a tool without
understanding it.

> > > > 2c. Sometimes you need multiple passes
> > > > ======================================

> > So you need an updatable symbol table?. Build one. Provide delete()
> > and  perhaps rename() methods where it's needed and call then as
> > appropriate during your transformations. Don't see a problem here
> sorry.
> 
> By the time the symbol table contains all the functionality
> I need, it's no longer a "symbol table" (at least not like
> one I've ever heard of), it's a "tree library".

The attributes in the symbol table are accessible from the nodes they
"belong to". Beyond that, there usually isn't much "tree handling or
awareness" in symbol tables.

Symbol tables are usually built for fast querying of node attributes
only and are often not updatable (there is no need since source files
don't change during compilation). You can make them updatable if your
applications need it.

> > > > 4. Comparing approaches by analyzing ease of change
> > > > ===================================================

> Right, but the whole point of the tree grammar is to minimize
> the amount of code that you have to write. What's the point
> of embedding 30,000 lines of code inside a 350 line grammar,
> if you could have just written 30,020 lines that do the same
> thing?

A little exagerated no doubt. I should point out that tree parsers
don't just walk the AST. They can pattern match within the AST and
apply *transformations* too. The action code would typically be for
semantic checks to determine appropriate transformations but, they can
be [almost] anything you wish.

> > > So in that case, the treewalker approach isn't buying you
> anything.
> > 
> > Time. You don't have to write the tree parsers by hand.
> 
> Only if a top-down walk of the tree is the basis of the
> code that you need. If your code is something else (like
> applying a series of pattern-matching rules), then
> you have to write all the code even when using a tree grammar.

No. Tree parsers can pattern match using AST subtree patterns
specified in the grammar. They can also perform AST tranformations.
But you'd need to learn ANTLR's tree parser syntax for that.

> > > And isn't that quite a bit more involved than:
> > >     print(getChild(ast, PACKAGE_DEF));
> > >     printChildren(ast, "\n",  IMPORT);
> > >     out.println();
> > >     out.println();
> > >     print(getChild(ast, CLASS_DEF));    // only one of these...
> > >     print(getChild(ast, INTERFACE_DEF));  // will print anything
> > >     out.println();
> > 
> > This seven lines of code don't matter much. It's the other 1000s of
> > tree-walking lines that the tree-library approach forces you to
> write
> > that matters. In the tree-grammar approach, ANTLR generates all
> that.
> 
> No, it doesn't generate any of that.
> The only thing the tree grammar approach is giving you is that
> its walking the tree for you. You still have to write all the
> code that does everything else.

I've already pointed out that tree parsers do more than just walk the
tree. They also search, replace, create and delete subtrees too. They
can transform ASTs.

> > Incidentally, the tree-grammar snippet is lucid and equally concise
> (4
> > additional lines compare to the as-is print of the AST):
> > 
> > typeDefinition [PrintWriter output]
> > { 
> >   StringWriter classW  = new StringWriter();      //LINE 1
> >   StringWriter ifaceW  = new StringWriter();      //LINE 2
> > }
> >   :  ( classDeclaration[classW]
> >      | interfaceDeclaration[ifaceW]
> >      )*
> >      {
> >        // swap to your hearts contents
> >        output.write(classW.toString();          //LINE 3
> >        output.write(ifaceW.toString();          //LINE 4
> >      }
> >   ;
> 
> I'm sorry, I just don't think that's as clear.
> Most of that's because I know Java much more than I know
> ANTLR. But most people know Java (or C, C++, or C#) much
> more than ANTLR.

Most people are not using ANTLR to build their systems. If they were,
they'd [better] know ANTLR.

> > > > 5. Limitations of TreeWalkers
> > > > =============================
> > > > 
> > > > isn't the real problem that "a[3]" is an array access while
> "int[]
> > > x"
> > > > is an array declaration?.
> > > 
> > > No, the problem is that a C array declaration can take either
> form:
> > > int a[3];
> > > int[3] a;
> > 
> > And they mean the same thing right?. Isn't the point of an AST to
> > remove concrete syntax irregularities like this?. Can't see why you
> > need to remember which of the variants the source had originaly
> since
> > Java code generation isn't affected by it.
> 
> That's the point of ASTs in theory, but not in practice.
> This cgram grammar does generate different ASTs
> for these two inputs, for example.

The cgram grammar was intended for source-to-source translation of C
programs (if memory serves). In that case it's easy to understand why
Monty chose to generate a pseudo-AST - the abstract syntax of C with
some concrete syntax left in for faithful-ish reproduction of the
original source.

For a C-to-Java translator, you could [should?] probably have removed
this particular concrete-ness from the pseudo-AST. 

> > > The child of an ARRAY_DECLARATION is always an EXPR.
> > 
> > But "int[] x" has no expression....
> 
> Yes it does, it's just the empty expression - an EXPR with
> no children.

The empty expression?. Well, writing a rule to differentiate this case
from the other seems easy enough - check for the EXPR's children.

> > My generated tree-grammar does the same ultimately. I just don't
> have
> > to write all the code manually.
> 
> Which code is it that you don't have to write because of the
> tree grammar? Isn't it only some simple code that walks the
> tree? I mean, isn't it little more than this:
> void walk(AST ast) {
>    doSomeAction(ast);
>    Iterator i = ast.getChildren().iterator();
>    while (i.hasNext()) {
>      AST child = (AST) i.next();
>      walk(child);
>    }
> }

It can be a lot more than an AST traversal. It depends on what I put
into my grammar. See my earlier comments.

> > > > 6. Contrasting the approaches
> > > > =============================
> > > > 
> > > > 1. Code generation isn't magic. We all use it quite happily for
> > > lexers
> > > > and parsers for instance. The same benefits exist for tree
> parsers.
> > > 
> > > We use ANTLR-like tools for lexers and parsers because the
> > > code they generate is straightforward and generic. Given a
> grammar,
> > > you know exactly what the lexer and parser code should look like.
> > 
> > So you are unfamiliar with tree parsers. Sounds like you could
> benefit
> > from learning more about them.
> 
> I think I do understand tree parsers. I understand what code
> they will generate. But the code they generate is not
> the best basis to build a translation app on top of.
> 
> A lexer is just a lexer: the ANTLR-generated lexer does its
> job, and your app can just deal with it's output (a token stream).
> A Parser does it's job so that you can deal with its output
> (an AST). A treewalker is not providing a clear "output"
> like a lexer or parser, it's just providing a framework to
> automatically walk a tree, for you to embed your actions within.
> My main point is that any sufficiently complex
> translator will not have a top-down tree walk as it's
> underlying framework.

Both the input and the output of a tree parser (tree-parser !=
tree-walker) is an AST. It is an AST-to-AST transformer. In the
simplest case (the tree walker) it simply visits each node of the AST
in depth-first order. The input and output ASTs are structurally the same.

In more complex cases, the AST would be transformed and the input and
output ASTs would be structurally different.

> > > However, with AST-to-AST transformation, it's not at all clear
> > > what the functionality needs to be. For example, given a "C source
> > > AST" to "Java source AST", we would all come up with different
> > > algorithms to do that transformation.
> > 
> > As we could about the structure of the AST we build in the parser.
> Or
> > if to build an AST at all. Or the names we give to our tokens in the
> > lexer and how many there are. Or whether to use two lexers and a
> > parser or just one of each etc....
> 
> Not really. Given the problem of lexing a single C source
> file, we'd all choose the same solution: one lexer (generated from
> some nice list of tokens). Given the problem of parsing a
> single token-stream into an AST, we'd all chose the same solution:
> a parser (generated from some BNF-like input grammar).

Not at all. I might be able to implement my system with the string
hangling sheenanigans of sed/awk/perl etc. The problem isn't ever
"lexing" or "parsing". It's building a source-to-source translator or,
computing source code metrics etc.

> But given a single AST that represents a C program, some
> would choose a treewalker to change it to a "Java AST", and
> others would not.

It's rather more that some would use a tool to generate "tree walking,
searching & transforming code" to do the translation. Other would
chose to write the "tree walking, searching & transforming code" by
hand as "tree library" code.

> > > We would all end up with
> > > a set of "rules" like "Find all FUNCTION_DEF nodes with an IDENT
> > > child with text 'main' and a PARAMS_DEF child that has two
> children,
> > > the first of which has a TYPE node with type 'int' ...
> > > Does a symbol table help us with finding such a node?
> > 
> > Nodes are part of the AST. Symbol table stores node *attributes*.
> 
> So then your answer is "no, a symbol table does not contain
> that information". OK. And what if most of my app deals with
> that type of information that's not available in a symbol
> table?

You example rule is searching for "nodes" (actually AST subtree
patterns). You would need to search the AST and not the symbol table.
The node's "type" attribute might be stored in the symbol table.

> > > I think the library approach is easier, especially if we can use
> > > a preexisting nice, standard tree-search-library out there.
> > 
> > As Loring pointed out, trees for different apps are likely to be
> very
> > different indeed. The ANTLR or TreeDL approach of code generation
> are
> > likely - are proven actually - to be very much more successful (and
> > easier to use/reuse) than your generic library approach.
> 
> Can you give me some examples of uses of ANTLR treewalkers 
> to do complex translations? Someone else mentioned ASPA,
> which I'm investigating now.

ASPA is a good example. 

This [student] project description might help too:
  http://www.cc.gatech.edu/classes/AY2001/cs4240_fall/prj2/prj2.html

> OK, well I'm saying that a typical "translation rule" is going
> to contain multiple phases. And it's going to be very hard
> to keep the code for each rule separate from the others when
> you have multiple phases.

You might want consider separating analysis phases (that gather the
info you need to make your synthesis decisions) from synthesis phases
(that perform the actual transformations).

> > > But I'm able to use the ANTLR lexer and parser without any real
> > > training. I should be able to *use* ANTLR without really knowing
> > > much about ANTLR grammars.
> > 
> > Did you require the same of Java?. To be able to use it without
> > knowing much about the language or it's libraries.
> 
> Yes! I use many programs that are written in Java, C, and many
> other languages without ever seeing their source code.
> I use the Java libraries without understanding their internals.

Cute but, can you use Java without understanding the Java language and
it's semantics, or knowing anything about it's libraries. Not the
implementation of Java compilers, VMs and the source of the libraries
(although they might help for very advanced systems).

> Are you saying it's not reasonable for me to want to be
> able to deal with ASTs without having to use ANTLR syntax?
> Shouldn't I be able to just swap out ANTLR and plug in
> lex/yacc or some other AST-generating tool (not that I
> would, of course :)?

No. We are comparing writing your own AST processing code manually to
having a tool generate it for you based on the AST's grammar.

> > > I just want ANTLR to lex and parse
> > > C source and pass me the AST, and I'll take it from there.
> > 
> > Why use an AST at all?. Or indeed a generated lexer or parser?.
> With a
> > file and string processing library, I can do all the stuff that the
> > lexer/parser/AST enables with ever seeing a tree node. It would be
> > messy but it would be all Java and I probably won't even have to
> learn
> > Java. ;-)
> 
> Heh, I know you're joking, but that's almost exactly what I
> actually spent the last two years doing!

;-)

> I built the lexer with ANTLR, and I use an ANTLR-generated parser
> for expression processing, but the rest is pure Java.
> The approach does have drawbacks, but I'm convinced that it
> was the right decision.

But was it the "best" decision from an engineering pov?. 
Could it have been done quicker?. 
Can the system scale to handle ever more complex analysis?. 
Can the system translate millions of lines of C in a short time using
reasonable machine resources? (compared to competitors) 
How easily could the system be modified to do say Ada-to-Java for
instance?. How long would that change take?

> I've got more source-to-source translation functionality
> than I've seen than any other tool, by an order of magnitude.
> (Feel free to show me one that has more, of course :)

The C-to-Java translation tools market is foreign to me ;-)

> > > I'm not proposing "hand-written tree walkers" so much as
> > > a "tree searching and manipulation library". My whole
> > > point is that AST-to-AST translation is better done as a
> > > rule-based pattern-matching scheme than a top-down AST-walking
> > > scheme. Take a look at:
> > > http://jazillian.com/how.html
> > > And think about how you'd do all those things with a treewalker.
> > > i'm certain it would be horrendous.
> > 
> > It's a lot of work but the code structure would be simple and easy
> to
> > maintain.
> 
> Again, any examples of ANTLR treewalkers that 
> have that much functionality?

I don't know of any open source example. Does Ter's multi-pass
translators and Monty's AREV Basic work count?.

> > > > 7. Summary
> > > > ==========

> > I feel I should repeat that you probably would benefit from
> reviewing
> > the literature on static code analysis techniques and
> implementations.
> > That's what you are doing in an ad-hoc fashion. It will hurt
> > eventually as you attempt more complex analysis. 
> 
> Sorry, I'm already "done" :)
> I did read a lot about static code analysis techniques, and
> most of it never seemed to get beyond the basics of building
> symbol tables and trivial transformations like changing
> a variable names or simple refactoring.

Even lint/splint etc?. Compaq's ESC?. You *are* performing static
semantic analysis and using the result to informs decisions about
which AST transformations to perform.

> > IOW, some of the
> > issues you raise are really about tree-library vs tree-grammar IMHO.
> 
> That's what I'm trying for...that's what the thread's about.

I meant they **aren't** really about tree-library vs tree-grammar. I
posted a correction.

Cheers,

Micheal
ANTLR/C#

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/