[antlr-interest] Newbie Question
mzukowski at bco.com
mzukowski at bco.com
Wed Oct 10 06:49:31 PDT 2001
> -----Original Message-----
> From: Tiller, Michael (M.M.) [mailto:mtiller at ford.com]
> Sent: Friday, October 05, 2001 12:14 PM
> To: 'antlr-interest at yahoogroups.com'
> Subject: [antlr-interest] Newbie Question
>
>
> I'm still trying to understand how to make the best use of
> ANTLR. I'd like to describe a simple scenario (that I am
> currently faced with) and ask a few questions about it...
>
> As a side project, I would like to build a tool that will
> check some source code against some style guidelines that we
> have internally. Toward this end, I have developed a lexer
> and parser already that can parse the language that I am
> interested in. So far, so good.
Does your style guide include whitespace, formatting and comment structure?
These are relatively hard problems.
> At this point, it seems to me that I have the pieces for
> turning the characters into tokens and then tokens into
> productions. Now what I want to do is walk the tree that
> gets generated by the parser.
>
> It seems ugly to actually add my style checking guidelines to
> the grammar because I might want to use the grammar at some
> point for some other applications. My understanding is that
> I should probably write a tree walker that is specific to my
> current application and leave the grammar alone. This
> protects for future applications. Is my understanding
> correct in this regard?
Yes.
> So, I am currently (perhaps misguidedly) looking at building
> a tree walker. But this brings up what seems like an
> important issue with the grammar. What kinds of tree
> construction markup should I do to the grammar to get
> something that will be easy to walk? In most of the markups
> of grammars, I see lines like:
>
> declaration
> : type ^IDENT SEMI!
> ;
>
> The tree node will now have an IDENT token as the root of the
> declaration. This seems like it would make finding a
> declaration very difficult because your tree would be something like:
>
> #(IDENT <type info>)
>
> Is my understanding of this correct?
Yes.
> It seems like a much better way would be to provide tree node
> types for the productions so it would be more like:
>
> #(DECLARATION type name)
>
> Am I correct in assuming that ANTLR doesn't do something like
> this because the tree would get really cluttered with
> production names as well as tokens? So, it would seem you
> might want to selectively choose the most important
> productions to use for nodes. In this case, should protected
> tokens be introduced in the lexer for this purpose?
Yes, preserving the structure of the parse would give you a parse tree that
is tedious to use, especially for expressions.
> I'm
> guessing that this is why the 'GNU C' grammar in the
> resources section of the ANTLR web page has things like this:
> declaration
> : ds:declSpecifiers (initDeclList[ds1])? ( SEMI )+
> { ## = #( #[NDeclaration], ##); }
> ;
>
> Am I correct in assuming that this attempting to create a
> node with a root of 'NDeclaration' so that this is easier to
> identify later on?
Exactly!
> For my application, it seems quite reasonable to identify
> certain key structural pieces and do things like the above.
> If I start doing this, I would like to protect for the
> possibility of having both Java and C++ tree walkers. Is it,
> in general, possible to build a parser grammar with markups
> and so on without having to write any target language
> specific actions? I assume this would be the goal if
> somebody were trying to develop a language neutral grammar?
When I wrote my translator from AREV R/BASIC to Visual BASIC I made a master
tree grammar that encompassed both languages so I could represent the
intermediate stages of translation. You just have to make sure that the
tokens used by each parser don't overlap.
> From a software engineering perspective, it seems somewhat
> strange that the treewalker mimics the structure of the
> grammar so heavily. It seems like this is pretty redundant.
> The tree walkers don't follow the exact same structure as the
> grammar, but I'm trying to understand the relationship
> between the two structures and what the implications are if
> the grammar structure is changed. It seems like it would be
> pretty undesirable to have changes in the grammar creating
> lots of problems in a tree walker if the grammar change was
> only a minor syntactical modification. How do you avoid
> these kinds of issues?
My GNU C grammar source is actually a noweb file which I've structured so
that similar rules from the parser and tree walkers are all close to each
other. That way when I change a grammar rule it's right in my face to
change the tree grammars too. In general the parser grammar (and hence the
tree grammar) usually comes to a fixed state once you've debugged it. Then
the interesting work is in specializing the actions of your tree grammar,
for which antlr subclassing is useful.
> Finally, I haven't had enough experience playing around with
> the tree walkers, but I recall that there is some mention in
> the documentation about the matching rules being different
> for tree walkers. I can go back and read the rules again,
> but I have a bigger question than how the rules work. What I
> want to understand are the implications. It seems as though
> the tree walking is designed so that you don't have to mimic
> the grammar structure but instead you can just pluck patterns
> you are interested in out of the tree generated by the
> parser. Do these matching rules for tree parsers essentially
> mean that you can get away with this (i.e. only writing rules
> for particular patterns you are interested in)? Does this,
> to some extent, mitigate the problems of grammar changes
> causing lots of tearup in the tree walker?
Generally you structure your trees so they are easy to walk. The degenerate
case would be a tree which is nothing but a flat sibling list of the tokens
as they were parsed. Then your tree grammar is your parser and you haven't
gained anything by creating a tree. Adding the parent/child relationships
gives structure to your tree. Certain things can be encoded just in the
structure of the tree, so parens and semicolons can be left out.
I typically have a complete tree grammar empty of actions that I then
subclass and override just the actions I'm interested in to do some
transformations.
> I guess my overall impression (although I haven't had enough
> time to actually try much of this out so far) is that it
> would be important to markup the grammar so that the
> resulting tree distills out much of the tiny syntax minutiae
> and instead try and capture (in the tree being built) the
> fundamental structure. In other words, a declaration
> consists of a name, type information and possibly array size
> but you would avoid recording things like the braces or the
> semicolon in the tree. If you assume that the fundamental
> structure (i.e. all declarations have a type and name) isn't
> likely to change then changes in the grammar wouldn't really
> propagate to the tree walker.
>
> Is (any of) this correct? :-)
Yep, you've got the basic concepts right.
Monty
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20011010/b423e011/attachment.html
More information about the antlr-interest
mailing list