[antlr-interest] AST Node data

Tue Jul 29 11:20:39 PDT 2003

I was just reworking some of my heterogeneous AST types in C++ and I noticed
certain patterns that I thought were interesting.  I mentioning this for the
benefit of those people working on ANTLR 3 and/or reworking C++ AST support.

In building up my nodes during parsing, I occasionally store certain
information away directly in each node rather than using other nodes to
capture it.  For example, if I have a declaration rule that says:

declaration
  : (qualifiers)* type_name:IDENT var_name:IDENT ";"
  ;

I will suppress the generation of AST nodes for "type" and "var".  I do this
because there is nothing inherently dynamic about these two pieces of
information since they are always there and their type is known (so why use
an abstract tree to store them?).  So with tree construction markup, my rule
would then look something like:

declaration
  : (qualifiers)* type:IDENT! var:IDENT! ";"!
    {
       ##->p_typeName = #type->getText();
       ##->p_varName = #var->getText();
    }
  ;

Note the use of the "p_" prefix.  The idea here is that this is information
gleaned purely from parsing (no semantics associated with it).

OK, that isn't very interesting (and perhaps its even misguided :-), but the
interesting part is the other stuff on the node.  I call this "meta data"
although there is probably a more appropriate term.  This is information
that represents relationships BETWEEN NODES.  The obvious child, sibling
stuff is there, but this only covers the base relationships.  The other
"meta_data" goes beyond these simple relationships.

For example, as I run through the parsing (or possibly in a later pass with
a tree parser), I might want to capture some scope information.  This isn't
strictly parser data (i.e. data that is prefixed with "p_") because it
requires some knowledge of language semantics (beyond just the grammar).  In
such a case, I might capture the scope information by adding another line to
my action:

declaration
  : (qualifiers)* type:IDENT! var:IDENT! ";"!
    {
       ##->p_typeName = #type->getText();
       ##->p_varName = #var->getText();
       ##->m_scope = scope;
    }
  ;

...where scope is a member variable in my parser and it gets updated as a
descend through the tree.  Node the "m_" prefix.  This is an indication that
this is a relationship with some other node in my system (not necessarily a
parent or child either).  It is probably worth mentioning that if you were
to reverse the parsing process (i.e. output code based on the AST), the "p_"
information would have to be taken into account, but the "m_" data
shouldn't.  The "m_" data is only useful for transformations that require
semantic information.

OK, the pattern I noticed was that among the "m_" data, there are really
three kinds.  Some "m_" data is shorthand for stuff below my node in the
tree.  In other words, when I create the node I might do a search through my
child nodes and extract some useful information so that I have it "handy"
(i.e. I don't have to search the tree again).  Some information is related
to my context in the tree (i.e. information above this node in the AST).
Finally, some information is related to entirely different sections of the
tree.

Why is this classification interesting.  Well, when you call "dupTree", you
need to be able to update this information and what you do depends on the
kind of information it is.  By duplicating a node in the tree, you are
creating a copy and this affects each of these classes differently.  For
data that is referencing information below you, you need to redo the
original search that got you that data (since your children have been
duplicated too).  For data above you, you need to rely on whoever "duped"
you to fill that in (since nodes don't include any upward looking
references, at least by default).  Data that refers to other parts of the
tree is unaffected since its relationship to those nodes is not context
dependent so you can just carry it over.

I had already written my own "dupTree" method since ANTLR only provides this
method in association with factory objects (which I don't really like).  To
accommodate updating this information during a "dupTree", I created another
method on my nodes called "resetMetaData" (for lack of a better name) and
while clone() gets called as we work our way *down* during the dupTree,
resetMetaData() gets called on the way back *up* (so we can recover
information about child nodes).

Now I'm not disclosing all this because I think it is necessarily the best
way to do this (hell, I don't really know much about building
parsers/translators).  But, this provides one data point about how it might
be done by some users and this can be useful in working out the design of
any future features or enhancements.  Furthermore, it sounds like ANTLR3 is
going to try and abstract the tree building process even more and so it
seems useful to try and contribute these categorization ideas so they can be
reconciled with proposed abstractions.

OK, I just wanted to dump that before I forget it all and moved on to
something else.  Back to work...

--
Mike

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20030729/1e47aeec/attachment.html