[antlr-interest] [3.1.1] ANTLR3_MIN_TOKEN_TYPE define possibly incorrect

Thu Mar 26 01:57:40 PDT 2009

[...]
> > At first I only checked that LT(1) returned something non-NULL, but
> > since a node was returned in (2) I ended up creating the wrong node
> > to
> > return in the rewrite rule. Then I found out that the type returned
> > was
> > ANTLR3_TOKEN_UP and that I could use ANTLR3_MIN_TOKEN_TYPE to
> > determine
> > whether I had a 'valid' node or not.
> > 
> > Should LT(1) return a node in (2), or does that signal that
> > something's
> > amiss? If the behavior of LT(1) is correct, how can I determine that
> > a
> > node has no next sibling?
> >   
> LT just returns the next token in the input stream and has no regard
> for rule boundaries etc. 
> 
> The input stream for an AST consists of a flat stream of tokens with
> the AST structure represented by the nodes ANTLR_TOKEN_DOWN and
> ANTLR_TOKEN_UP.
Ah, that's interesting to know.
> 
> So, if LT(1) is ANTLR_TOKEN_UP, then it means there was no child
> token. However you can ask the node what the child count is too (see
> API). More impotantly though, this probably means that your body rule
> can be empty, in which case your grammar will have written the
> childless token without a DOWN and UP pair as a rewrite like this:
> 
> : X -> ^(X)
> 
> is the same as 
> 
> : X-> X
> 
> So, you might find it easier to have:
> 
> :  ^( node=NEEDS_TO_BE_REWRITTEN b=body )
>      -> { createNewNode( $node, $b, LT( 1 ) ) }
> | node=NEEDS_TO_BE_REWRITTEN
> ;
> 
> Also though, if body cannot be empty, then LT(1) in this instance will
> ALWAYS be ANTLR_TOKEN_UP, matching the closing ) of the rule
> specification.

The body cannot be empty. The input I'm processing is something like
this:

A B C C ( D | E ) C ( C | C ) C
     ^             ^         ^ 
     P             P         P

With A..E the types of elements and the | an alternation of two or more
choices. There's a ton more operators that can be applied to the
elements. Every C has to be transformed into a C' and inbetween two C's
that have no other operators applied to them comes another node P. Such
a P node can also be inserted between a C and an alternation of only
C's, on either or both sides of the alternation. Expressions can be
infinitely nested. The AST somewhat looks like this:

|
+- A
+- B
+- C
+- C
+- ALTERNATION
|  +- GROUP
|  |  +- D
|  +- GROUP
|     +- E
+- C
+- ALTERNATION
|  +- GROUP
|     +- C
|  +- GROUP
|     +- C
+ C

Simplified, I have expressed this like so:

element_list : 
  element+ 
;

element : 
  A
| B
| C
| alternation
|  // ... more elements and operators that look like 
   // ^( OPERATOR element_list )
;

alternation :
| ^( ALTERNATION group+ )
;

group :
  ^( GROUP element_list )
;

I have been unable to imagine a pattern that would single out those C's
that need a P inbetween. I would guess that the compiler would complain
about multiple alternatives for input... if I were to add special case
for two consecutive C's, right? 

- Incidentally, when will the AST filter grammars, as discussed in
http://www.antlr.org/wiki/display/~admin/2008/10/23/tree+pattern
+matching+grammars, be available? Or are they already? -

So I settled for the very general grammar, which was easier to write and
thus maintain, and a fallback rewrite rule with the rewrite logic
offloaded to a C++ function. I used LT(1) to get to the next sibling.

Sven