[antlr-interest] Lexer not pulling in fragments?

John B. Brodie jbb at acm.org
Thu Apr 2 08:19:02 PDT 2009


On Thursday 02 April 2009 10:27:27 am Jim Idle wrote:
> Joseph Klumpp wrote:
> > I'm trying to create tokens for the guards of C header files (with
> > filter=true), e.g. '#define __hello_h_' => <GUARD, #define
> > __hello_h_>, and have the following rules defined:
> >
> > GUARD	:	'#' LETTER+ WS+ IDPART '_';
> > ID	:	IDPART;
> >
> > WS	: 	(' ' | '\n')+	{$channel = HIDDEN;};
> >
> > fragment
> > IDPART	:	LETTER ( LETTER | DIGIT )*;
> >
> > fragment
> > LETTER
> >
> > 	:	'$'
> > 	:
> > 	|	'\u0041'..'\u005a'
> > 	|	'\u0061'..'\u007a'
> > 	|	'_'
> >
> > 	;
> >
> > fragment
> > DIGIT	: 	'0'..'9';
> >
> > Using these rules GUARD will never appear in the token stream. If I
> > change it to:
> > GUARD	:	'#' LETTER+ WS+ LETTER (LETTER | DIGIT)* '_';
> > the rule lexes correctly. I have two questions:
> > 1. Why does it not lex correctly when I lex with IDPART?
>
> You have WS+, but the WS rule is already a +, you just need WS. This is
> probably scrweing with the analysis in some way. You shoudl be getting a
> warning about htis thoguh, are you not?

and you probably should make the WS a fragment so as to avoid the overhead of 
creating a token (even tho that overhead is small)

WS_ignored : WS	  {$channel = HIDDEN;};
fragment WS : (' ' | '\n')+ ;

> > 2. Is there a way to set the value of token GUARD to be just the
> > IDPART portion of the lexem?
>
> GUARD	:	'#' LETTER+ WS idp=IDPART '_'
> 			{ $text = $idp.text; } // Should work
..............            but $idp.text will not contain the trailing '_'
>         ;
>
i think your problem with using the IDPART fragment in the GUARD rule is the 
trailing '_' required after the IDPART in your GUARD rule.

since IDPART itself permits a trailing '_', ANTLR would have to generate two 
flavors of the IDPART fragment, one for the ID in which the final '_' is 
included and another for GUARD in which the final '_' is not included.  I do 
not think ANTLR can do this.  By hoisting the fragment into the GUARD rule you 
allowed ANTLR to see that the final '_' was not part of the IDPART but part of 
the syntax.

and oh by the way (and my knowledge of C is old) but isn't the trailing '_' in 
a GUARD just a coding convention and not a syntactic constraint? e.g. isn't 
"#define foo" a valid guard?



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090402/64238bd9/attachment.html 


More information about the antlr-interest mailing list