[antlr-interest] CommonTree & Tree grammar versus DIY

Fri Aug 22 11:28:17 PDT 2008

Terence Parr wrote:
> 
> On Aug 21, 2008, at 3:33 PM, Andy Tripp wrote:
>>> Apparently you've never made a cycle in a tree before ;)
>>> Recursive/self-similar data structures *ARE* more difficult to alter
>>> properly than a List; by hand anyway.
>> Hmm. I guess I just don't see it.
>> A List of Person objects doesn't get much harder when the Person class
>> has a "Person mom" field. Sure, you can accidentally create a cycle, but
>> you can also accidentally make the same mistake with a List (e.g.
>> having a "next" pointer in a linked list point somewhere invalid, like
>> to itself).
> 
> OnlyIf you have not found a library to implement List, right?

Right, I'm assuming we have a library implementation of List and Tree.
And I do appreciate your point that it's hard (or impossible) to mess up
a List while using a working implementation of it, but you can easily mess up a 
Tree while using a working implementation of it.

> 
>>> For lang X to X, morphing trees
>>> is ok but altering language makes each new pass have to match a new kind
>>> of tree.
>>
>> No, it doesn't. There is no "new kind of tree" -
> 
> well, it was assumed you knew I meant different structure. obviously, I 
> always make a single node class so all nodes are always of the same type.

So a tree has a "different structure" when you add, delete, or change a node?
Then a list also has a "different structure" when you add, delete, or change
an item. I wouldn't use that terminology - I'd say each is just a data 
structure, which of course has its contents change over time.

I feel like you're saying a tree "changes structure" when its data changes,
whereas you wouldn't say the same for a list (for example). And I think you
say that because you're used to a tool (ANTLR treewalker) where you actually
specify a trees "exact structure". But you don't have a similar tool for a
list.

Suppose I worked every day with a tool where I specify a List "structure"
like this:
     FAMILY = { MOM [DAD] CHILDREN* }
...and then I said "it's inherently dangerous to change the structure of
a list - e.g. moving the DAD item to be the first item in the list".
You'd look at me funny and say "the list is just a data structure, Andy.
Having its contents change is what it *does* for a living. There's
nothing dangerous about it."  I think that's what's happening here, but
with trees rather than lists.

> 
>> it stays a tree data structure
>> always. It's the ANTLR treewalker implementation that makes it seem 
>> like there's
>> "something new" going on.
> 
> Actually, you are trying to pretend that there is no new structure. But, 
> as you have pointed out to me many times you have to do all sorts of 
> weird matching to try the test for the new possibilities. This is a new 
> structure. Yes, you are parsing the same kind of raw elements, but the 
> structure is different. Structure imparts meaning. If this were not 
> true, we would not parse at all. We would simply do "while more tokens, 
> consume". That is not recognition or it without recognition you can do 
> nothing. If you change the language, you must change the code that does 
> the recognition. I believe you simply didn't get my assumption.
> 
>> The "shape" of the tree is changing, just as any
>> other data structure's "shape" changes over time.
> 
> Oh, so you did make that assumption. strange.

Yes, tree nodes contents change and new ones are added and deleted.
Just as for a list. 

> 
>> It's the ANTLR treewalker
>> forcing you to provide a "snapshot" of that evolving shape that's making
>> it seem like the tree is somehow now of "a new kind of tree".
> 
> Yes. How could it be otherwise, if you are walking the entire tree? 

Easy. Just don't use a treewalker, and do it by hand.
For example, my Java pretty printer
(http://www.jazillian.com/antlr/emitter.html)
processes the tree, without ever depending on the tree's overall structure.
Each node type is handled in isolation.
There is no "snapshot" of the "AST shape". 
It does not know or care that at the top level, a Java file contains a class
or interface.

> If 
> you choose to parse only those parts of the tree that have not changed, 
> then you don't have to change your recognition rules or rewrite rules in 
> your case. In your case, you have to make many many passes over the 
> input so you probably just ignore what has been rewritten if it is 
> written to a new language. 

> If you need to parse things that have been 
> transformed from one language to the other, how do you propose to 
> recognize them without rules to handle the new constructs?  

You need "rules", I just don't want those "rules" to have to know the 
overall tree structure. That's how my translator works (though with
token streams, not ASTs, but none of my "rules" depend on the overall
structure of a C, Java, or C/Java hybrid program).

> Andy, I 
> respect the work you have done immensely in rewriting, but I just never 
> understand your perspective on tree walking.

Let me try again then. Suppose in language "FROM" you can declare multiple variables:
int i, j;
...but in language "TO", you can't:
int i;
int j;

Let's start with the simple case: The FROM and TO languages 
are completely identical except for this one case.
To build your simple translator, you should *not* have to specify the whole AST structure.
Instead, you should be able to call some existing FROM parser, get an AST, and then do
something like:

void convert(AST ast) {
  if (ast.type == DECL && ast.getNumChildren() > 1) {
    ast.getParent().removeChild(ast);    // remove the "int i, j;" node
    for (AST child: ast.getChildren()) { 
      AST newChild = new AST(ast.type);  // create new nodes for i and j
      newChild.setChildren(child);
      ast.getParent().addChild(newChild); // add "int i;" and "int j;" nodes
    }
  }
}

Here, we've written a single rule that doesn't depend on any "structure" of the tree
other than the tiny part of the tree that it cares about. And the only
"framework" or "scaffolding" needed is this:

void walk(AST ast) {
  convert(ast);
  for (AST child: ast.getChildren()) {
    walk(child);
  }
}

So we're done. We got this working without any knowledge of overall AST structure,
and very few lines of code in just a single language. With a treewalker, we'd
have to know the overall structure (the .g file), learn another language
(ANTLR), and we'd *still* end up either writing the convert() code above, or
doing the equivalent in ANTLR or StringTemplate language.

You might be thinking "but we already *have* the overall AST structure -
just start with the .g file from the parser and make a couple minor changes.
That's a bad approach, for the same reason that copying a 200-line method
and making a couple minor changes, is bad.

For the complex case, assume that FROM and TO are quite different, and we'll
need a couple hundred "rules" like this one. We sure don't want to specify
the "structure" of the AST at the start of each of these rules. 

There is one key assumption here: it won't work to have a single (or even a just a few)
passes. If we could do it all in a single pass, great - use a treewalker, and just
put a bit of code at each key place in the .g file. But in practice, there is not
just one or two things that need to be done at each node type, there are many.
Off the top of my head, translating C++ to Java, have a variable declaration
node type "DECL", we have to deal with:

* access modifiers ("public", etc) are at higher level in C++
* dealing with C++ "friend" affects access modifiers
* "const" may (or may not) need to change to "final"
* C++ primitives are not guaranteed to be initialized, Java's are.
* C++ non-primitives can have space allocated (Person p[20]), but not in Java
* Java compiler checks (using control flow analysis) for ininitialized variables
* possibly change C int type to Java boolean
* convert between C primitives and Java primitives (based on architecture)
* possibly replace "char*" with "String"
* if the declaration is a main() parameter ("argv"), you may do extra work
* C++ allows "static" within a method, Java doesn't

...and so on. That's just off the top of my head, there are probably 2-3 times that
many "rules" that involve a DECL node in an AST when doing C++ to Java.
So, in the treewalker approach,
somewhere in our C++.g, we'd have one place where we say
"while walking this tree, when we come to a DECL node, call my code to do all that work":

    varDef
    :   #(DECL modifiers typeSpec methodHead (slist)?)
        { handleAllCasesRelatedToVariableDeclarations(); }
    ;

And now we write the handleAllCasesRelatedToVariableDeclarations() method, which
is probably thousands of lines. Was there any advantage to having this call
embedded in a .g file? No. It would have worked just as nicely to just use
a walk() "framework" shown above, or the standard Visitor pattern. Was there a
disadvantage? Yes. While writing all that code, you're surely going to find
that you really want to have multiple passes (e.g. let's process all typedefs
in some other rule first, so I don't have to deal with them here).
And multiple passes are going to be a pain with a treewalker, because you've
got to specify a .g for each pass.

In summary: for the simple case, the treewalker was overkill, making us specify
the entire AST structure when all we cared about was a tiny portion of it.
For the complex case, the treewalker was no help, providing a "jumping off point"
to the code that does all the work. As a "framework", it has no advantage
over the simpler walk() or Visitor Pattern approach, and has the disadvantage of
again relying on the overall AST structure. Plus, specifying a .g file
for each pass when we have many passes will be a real pain.

I hope that helps you see where I'm coming from.

I don't mean to put down your ANTLR implementation of treewalkers, or their use in general.
I'm mainly saying to those newbies who post "how do I change my AST?" the simple answer:
it's a tree data structure - do what you were taught in CS101. And to anyone
considering a full translator from one high-level language to another: beware
the treewalker approach.

Thanks for listening and your support.
There is a slight chance that I'm making a valid point here somewhere :)
Andy

p.s. ...in my opinion, Jim ;)