[antlr-interest] Lexer not pulling in fragments?
John B. Brodie
jbb at acm.org
Thu Apr 2 08:19:02 PDT 2009
On Thursday 02 April 2009 10:27:27 am Jim Idle wrote:
> Joseph Klumpp wrote:
> > I'm trying to create tokens for the guards of C header files (with
> > filter=true), e.g. '#define __hello_h_' => <GUARD, #define
> > __hello_h_>, and have the following rules defined:
> >
> > GUARD : '#' LETTER+ WS+ IDPART '_';
> > ID : IDPART;
> >
> > WS : (' ' | '\n')+ {$channel = HIDDEN;};
> >
> > fragment
> > IDPART : LETTER ( LETTER | DIGIT )*;
> >
> > fragment
> > LETTER
> >
> > : '$'
> > :
> > | '\u0041'..'\u005a'
> > | '\u0061'..'\u007a'
> > | '_'
> >
> > ;
> >
> > fragment
> > DIGIT : '0'..'9';
> >
> > Using these rules GUARD will never appear in the token stream. If I
> > change it to:
> > GUARD : '#' LETTER+ WS+ LETTER (LETTER | DIGIT)* '_';
> > the rule lexes correctly. I have two questions:
> > 1. Why does it not lex correctly when I lex with IDPART?
>
> You have WS+, but the WS rule is already a +, you just need WS. This is
> probably scrweing with the analysis in some way. You shoudl be getting a
> warning about htis thoguh, are you not?
and you probably should make the WS a fragment so as to avoid the overhead of
creating a token (even tho that overhead is small)
WS_ignored : WS {$channel = HIDDEN;};
fragment WS : (' ' | '\n')+ ;
> > 2. Is there a way to set the value of token GUARD to be just the
> > IDPART portion of the lexem?
>
> GUARD : '#' LETTER+ WS idp=IDPART '_'
> { $text = $idp.text; } // Should work
.............. but $idp.text will not contain the trailing '_'
> ;
>
i think your problem with using the IDPART fragment in the GUARD rule is the
trailing '_' required after the IDPART in your GUARD rule.
since IDPART itself permits a trailing '_', ANTLR would have to generate two
flavors of the IDPART fragment, one for the ID in which the final '_' is
included and another for GUARD in which the final '_' is not included. I do
not think ANTLR can do this. By hoisting the fragment into the GUARD rule you
allowed ANTLR to see that the final '_' was not part of the IDPART but part of
the syntax.
and oh by the way (and my knowledge of C is old) but isn't the trailing '_' in
a GUARD just a coding convention and not a syntactic constraint? e.g. isn't
"#define foo" a valid guard?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090402/64238bd9/attachment.html
More information about the antlr-interest
mailing list