[antlr-interest] [3.1.1] ANTLR3_MIN_TOKEN_TYPE define possibly incorrect
Sven Van Echelpoel
sven.van.echelpoel at empolis.com
Thu Mar 26 01:57:40 PDT 2009
[...]
> > At first I only checked that LT(1) returned something non-NULL, but
> > since a node was returned in (2) I ended up creating the wrong node
> > to
> > return in the rewrite rule. Then I found out that the type returned
> > was
> > ANTLR3_TOKEN_UP and that I could use ANTLR3_MIN_TOKEN_TYPE to
> > determine
> > whether I had a 'valid' node or not.
> >
> > Should LT(1) return a node in (2), or does that signal that
> > something's
> > amiss? If the behavior of LT(1) is correct, how can I determine that
> > a
> > node has no next sibling?
> >
> LT just returns the next token in the input stream and has no regard
> for rule boundaries etc.
>
> The input stream for an AST consists of a flat stream of tokens with
> the AST structure represented by the nodes ANTLR_TOKEN_DOWN and
> ANTLR_TOKEN_UP.
Ah, that's interesting to know.
>
> So, if LT(1) is ANTLR_TOKEN_UP, then it means there was no child
> token. However you can ask the node what the child count is too (see
> API). More impotantly though, this probably means that your body rule
> can be empty, in which case your grammar will have written the
> childless token without a DOWN and UP pair as a rewrite like this:
>
> : X -> ^(X)
>
> is the same as
>
> : X-> X
>
> So, you might find it easier to have:
>
> : ^( node=NEEDS_TO_BE_REWRITTEN b=body )
> -> { createNewNode( $node, $b, LT( 1 ) ) }
> | node=NEEDS_TO_BE_REWRITTEN
> ;
>
> Also though, if body cannot be empty, then LT(1) in this instance will
> ALWAYS be ANTLR_TOKEN_UP, matching the closing ) of the rule
> specification.
The body cannot be empty. The input I'm processing is something like
this:
A B C C ( D | E ) C ( C | C ) C
^ ^ ^
P P P
With A..E the types of elements and the | an alternation of two or more
choices. There's a ton more operators that can be applied to the
elements. Every C has to be transformed into a C' and inbetween two C's
that have no other operators applied to them comes another node P. Such
a P node can also be inserted between a C and an alternation of only
C's, on either or both sides of the alternation. Expressions can be
infinitely nested. The AST somewhat looks like this:
|
+- A
+- B
+- C
+- C
+- ALTERNATION
| +- GROUP
| | +- D
| +- GROUP
| +- E
+- C
+- ALTERNATION
| +- GROUP
| +- C
| +- GROUP
| +- C
+ C
Simplified, I have expressed this like so:
element_list :
element+
;
element :
A
| B
| C
| alternation
| // ... more elements and operators that look like
// ^( OPERATOR element_list )
;
alternation :
| ^( ALTERNATION group+ )
;
group :
^( GROUP element_list )
;
I have been unable to imagine a pattern that would single out those C's
that need a P inbetween. I would guess that the compiler would complain
about multiple alternatives for input... if I were to add special case
for two consecutive C's, right?
- Incidentally, when will the AST filter grammars, as discussed in
http://www.antlr.org/wiki/display/~admin/2008/10/23/tree+pattern
+matching+grammars, be available? Or are they already? -
So I settled for the very general grammar, which was easier to write and
thus maintain, and a fallback rewrite rule with the rewrite logic
offloaded to a C++ function. I used LT(1) to get to the next sibling.
Sven
More information about the antlr-interest
mailing list