[antlr-interest] Newbie Question

Fri Oct 5 12:13:41 PDT 2001

I'm still trying to understand how to make the best use of ANTLR.  I'd like to describe a simple scenario (that I am currently faced with) and ask a few questions about it...

As a side project, I would like to build a tool that will check some source code against some style guidelines that we have internally.  Toward this end, I have developed a lexer and parser already that can parse the language that I am interested in.  So far, so good.

At this point, it seems to me that I have the pieces for turning the characters into tokens and then tokens into productions.  Now what I want to do is walk the tree that gets generated by the parser.

It seems ugly to actually add my style checking guidelines to the grammar because I might want to use the grammar at some point for some other applications.  My understanding is that I should probably write a tree walker that is specific to my current application and leave the grammar alone.  This protects for future applications.  Is my understanding correct in this regard?

So, I am currently (perhaps misguidedly) looking at building a tree walker.  But this brings up what seems like an important issue with the grammar.  What kinds of tree construction markup should I do to the grammar to get something that will be easy to walk?  In most of the markups of grammars, I see lines like:

declaration
  : type ^IDENT SEMI!
  ;

The tree node will now have an IDENT token as the root of the declaration.  This seems like it would make finding a declaration very difficult because your tree would be something like:

#(IDENT <type info>)

Is my understanding of this correct?

It seems like a much better way would be to provide tree node types for the productions so it would be more like:

#(DECLARATION type name)

Am I correct in assuming that ANTLR doesn't do something like this because the tree would get really cluttered with production names as well as tokens?  So, it would seem you might want to selectively choose the most important productions to use for nodes.  In this case, should protected tokens be introduced in the lexer for this purpose?  I'm guessing that this is why the 'GNU C' grammar in the resources section of the ANTLR web page has things like this:

declaration
        : ds:declSpecifiers (initDeclList[ds1])? ( SEMI )+
              { ## = #( #[NDeclaration], ##); }
        ;

Am I correct in assuming that this attempting to create a node with a root of 'NDeclaration' so that this is easier to identify later on?

For my application, it seems quite reasonable to identify certain key structural pieces and do things like the above.  If I start doing this, I would like to protect for the possibility of having both Java and C++ tree walkers.  Is it, in general, possible to build a parser grammar with markups and so on without having to write any target language specific actions?  I assume this would be the goal if somebody were trying to develop a language neutral grammar?

>>From a software engineering perspective, it seems somewhat strange that the treewalker mimics the structure of the grammar so heavily.  It seems like this is pretty redundant.  The tree walkers don't follow the exact same structure as the grammar, but I'm trying to understand the relationship between the two structures and what the implications are if the grammar structure is changed.  It seems like it would be pretty undesirable to have changes in the grammar creating lots of problems in a tree walker if the grammar change was only a minor syntactical modification.  How do you avoid these kinds of issues?

Finally, I haven't had enough experience playing around with the tree walkers, but I recall that there is some mention in the documentation about the matching rules being different for tree walkers.  I can go back and read the rules again, but I have a bigger question than how the rules work.  What I want to understand are the implications.  It seems as though the tree walking is designed so that you don't have to mimic the grammar structure but instead you can just pluck patterns you are interested in out of the tree generated by the parser.  Do these matching rules for tree parsers essentially mean that you can get away with this (i.e. only writing rules for particular patterns you are interested in)?  Does this, to some extent, mitigate the problems of grammar changes causing lots of tearup in the tree walker?

I guess my overall impression (although I haven't had enough time to actually try much of this out so far) is that it would be important to markup the grammar so that the resulting tree distills out much of the tiny syntax minutiae and instead try and capture (in the tree being built) the fundamental structure.  In other words, a declaration consists of a name, type information and possibly array size but you would avoid recording things like the braces or the semicolon in the tree.  If you assume that the fundamental structure (i.e. all declarations have a type and name) isn't likely to change then changes in the grammar wouldn't really propagate to the tree walker.

Is (any of) this correct? :-)

Thanks

--
Mike

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20011005/f3401f64/attachment.html