[antlr-interest] Newbie Question

Wed Oct 10 06:49:31 PDT 2001

> -----Original Message-----
> From: Tiller, Michael (M.M.) [mailto:mtiller at ford.com]
> Sent: Friday, October 05, 2001 12:14 PM
> To: 'antlr-interest at yahoogroups.com'
> Subject: [antlr-interest] Newbie Question
> 
> 
> I'm still trying to understand how to make the best use of 
> ANTLR.  I'd like to describe a simple scenario (that I am 
> currently faced with) and ask a few questions about it...
>  
> As a side project, I would like to build a tool that will 
> check some source code against some style guidelines that we 
> have internally.  Toward this end, I have developed a lexer 
> and parser already that can parse the language that I am 
> interested in.  So far, so good.

Does your style guide include whitespace, formatting and comment structure?
These are relatively hard problems.

> At this point, it seems to me that I have the pieces for 
> turning the characters into tokens and then tokens into 
> productions.  Now what I want to do is walk the tree that 
> gets generated by the parser.
>  
> It seems ugly to actually add my style checking guidelines to 
> the grammar because I might want to use the grammar at some 
> point for some other applications.  My understanding is that 
> I should probably write a tree walker that is specific to my 
> current application and leave the grammar alone.  This 
> protects for future applications.  Is my understanding 
> correct in this regard?

Yes.  

> So, I am currently (perhaps misguidedly) looking at building 
> a tree walker.  But this brings up what seems like an 
> important issue with the grammar.  What kinds of tree 
> construction markup should I do to the grammar to get 
> something that will be easy to walk?  In most of the markups 
> of grammars, I see lines like:
>  
> declaration
>   : type ^IDENT SEMI!
>   ;
> 
> The tree node will now have an IDENT token as the root of the 
> declaration.  This seems like it would make finding a 
> declaration very difficult because your tree would be something like:
>  
> #(IDENT <type info>)
>  
> Is my understanding of this correct?

Yes.

> It seems like a much better way would be to provide tree node 
> types for the productions so it would be more like:
>  
> #(DECLARATION type name)
>  
> Am I correct in assuming that ANTLR doesn't do something like 
> this because the tree would get really cluttered with 
> production names as well as tokens?  So, it would seem you 
> might want to selectively choose the most important 
> productions to use for nodes.  In this case, should protected 
> tokens be introduced in the lexer for this purpose?

Yes, preserving the structure of the parse would give you a parse tree that
is tedious to use, especially for expressions.

> I'm 
> guessing that this is why the 'GNU C' grammar in the 
> resources section of the ANTLR web page has things like this:
>  declaration
>         : ds:declSpecifiers (initDeclList[ds1])? ( SEMI )+
>               { ## = #( #[NDeclaration], ##); }
>         ;
>  
> Am I correct in assuming that this attempting to create a 
> node with a root of 'NDeclaration' so that this is easier to 
> identify later on?

Exactly!

> For my application, it seems quite reasonable to identify 
> certain key structural pieces and do things like the above.  
> If I start doing this, I would like to protect for the 
> possibility of having both Java and C++ tree walkers.  Is it, 
> in general, possible to build a parser grammar with markups 
> and so on without having to write any target language 
> specific actions?  I assume this would be the goal if 
> somebody were trying to develop a language neutral grammar?

When I wrote my translator from AREV R/BASIC to Visual BASIC I made a master
tree grammar that encompassed both languages so I could represent the
intermediate stages of translation.  You just have to make sure that the
tokens used by each parser don't overlap.

> From a software engineering perspective, it seems somewhat 
> strange that the treewalker mimics the structure of the 
> grammar so heavily.  It seems like this is pretty redundant.  
> The tree walkers don't follow the exact same structure as the 
> grammar, but I'm trying to understand the relationship 
> between the two structures and what the implications are if 
> the grammar structure is changed.  It seems like it would be 
> pretty undesirable to have changes in the grammar creating 
> lots of problems in a tree walker if the grammar change was 
> only a minor syntactical modification.  How do you avoid 
> these kinds of issues?

My GNU C grammar source is actually a noweb file which I've structured so
that similar rules from the parser and tree walkers are all close to each
other.  That way when I change a grammar rule it's right in my face to
change the tree grammars too.  In general the parser grammar (and hence the
tree grammar) usually comes to a fixed state once you've debugged it.  Then
the interesting work is in specializing the actions of your tree grammar,
for which antlr subclassing is useful.

> Finally, I haven't had enough experience playing around with 
> the tree walkers, but I recall that there is some mention in 
> the documentation about the matching rules being different 
> for tree walkers.  I can go back and read the rules again, 
> but I have a bigger question than how the rules work.  What I 
> want to understand are the implications.  It seems as though 
> the tree walking is designed so that you don't have to mimic 
> the grammar structure but instead you can just pluck patterns 
> you are interested in out of the tree generated by the 
> parser.  Do these matching rules for tree parsers essentially 
> mean that you can get away with this (i.e. only writing rules 
> for particular patterns you are interested in)?  Does this, 
> to some extent, mitigate the problems of grammar changes 
> causing lots of tearup in the tree walker?

Generally you structure your trees so they are easy to walk.  The degenerate
case would be a tree which is nothing but a flat sibling list of the tokens
as they were parsed.  Then your tree grammar is your parser and you haven't
gained anything by creating a tree.  Adding the parent/child relationships
gives structure to your tree.  Certain things can be encoded just in the
structure of the tree, so parens and semicolons can be left out.

I typically have a complete tree grammar empty of actions that I then
subclass and override just the actions I'm interested in to do some
transformations.

> I guess my overall impression (although I haven't had enough 
> time to actually try much of this out so far) is that it 
> would be important to markup the grammar so that the 
> resulting tree distills out much of the tiny syntax minutiae 
> and instead try and capture (in the tree being built) the 
> fundamental structure.  In other words, a declaration 
> consists of a name, type information and possibly array size 
> but you would avoid recording things like the braces or the 
> semicolon in the tree.  If you assume that the fundamental 
> structure (i.e. all declarations have a type and name) isn't 
> likely to change then changes in the grammar wouldn't really 
> propagate to the tree walker.
>  
> Is (any of) this correct? :-)

Yep, you've got the basic concepts right.

Monty 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20011010/b423e011/attachment.html