[antlr-interest] Please help me with homogeneous versus heterogeneous trees!

Thu Jul 28 10:46:24 PDT 2005

Really good Antlr people,

I am struggling with the design decision of whether to use a heterogeneous vs homogeneous AST. I would love some advice from people who've been there and done that.

First, let me make this clear ... I agree 100% with what Terrance says about not putting *actions* into strongly typed node objects (see http://www.antlr.org/blog/antlr3/trees.tml and search for "Heterogeneous Nodes"). If you do that, and if you need more than one pass, then you've got code for 5 different passes all sliced up and scattered around each of your node classes. Yuck. So I am not asking about that aspect of the decision.

But I am struggling with the concept of *naming the relationships* between nodes. After all, the AST is supposed to be the "canonical representation" of what the source code says, right? It is meant to be simplified, robust against grammar changes, and convenient to understand (right?). So, if I think about how I'd design a tree with all that in mind, I start to picture nodes with named relationships between them. But that starts taking me in the direction of heterogeneous trees. 

Here's an example. Say I parse a SQL SELECT statement (which I happen to be doing). I use Antlr's great tree operators, and a tree grammar to do one pass, and here's part of the tree that I wind up with:

     SELECT
       |
       |
     FIELD ("sls_dt")
     |       |
     |       |
    ID("a")  ID("Sales_History")

As you can see, a FIELD node has an alias (that's what the first child is) and a table name (that's what the second child is). Now, here's the question. I can look at the tree, and using my own knowledge of SQL, I can see that the first child is the alias and the second child is the table name. But the tree doesn't *say* that.

But shouldn't a "canonical representation" of the meaning of the input code go ahead and spell such things out? In other words, can we say that the next tree is objectively better for most purposes?

     SELECT
       |
       |
     FIELD ("sls_dt")
     |       |
     |       |
   ALIAS   TABLENAME
     |       |
     |       |
    ID("a")  ID("Sales_History")

And, if that is an improvement, then the next question is, why don't I introduce a specific node class, called FieldNode, which has a method called getAlias() and another method called getTableName()? Then, my node class takes care of naming the relationships for me. Then, I don't need the 2 imaginary nodes.

I've seen this approach discussed in detail in books by John Gough and Pat Terry, both of whom have taught me a lot through their writings. But I haven't been able to find a detailed discussion of the *other* (all homogeneous) approach.

How have you long-time antlr users come to approach this, and why? And any pointers to materials that discuss this topic would be greatly appreciated.

Thanks!
Charlie Flowers 
Atlanta, GA