[antlr-interest] C-style includes: problem with parser vs. lexer rules

Mon Aug 27 07:29:30 PDT 2007

On 8/27/07, Bjoern Doebel <doebel at tudos.org> wrote:
> Hi,
>
> I want to parse C-style #include statements and got a working version like
> this:
>
> fragment DIGIT  : '0'..'9';
> fragment CHAR : 'a'..'z' | 'A'..'Z';
>
> IMPORT : '#include' ;
> GT : '>' ;
> LT : '<' ;
> WORD : CHAR (CHAR|DIGIT|'_'|'-')*;
> WS     : (' '|'\t'|'\n'|'\r')+ { self.skip(); } ;
>
> filename : WORD ('/' WORD)* '.' WORD ;
>
> import_r : IMPORT LT filename GT ;
>
>
> This works, but now I'd like to transfer the filename rule into a lexer
> rule, so I get only one single token from it. Therefore, I change the last
> two rules:
>
> FNAME : WORD ('/' WORD)* '.' WORD ;
>
> import_r : IMPORT LT FNAME GT;
>
> But when I run it with e.g., "#include <foo/bar/baz.h>", I get an error:
> line 1:8 mismatched input 'foo/baz/bar.h' expecting FNAME
>
> What am I doing wrong and why does the lexer not recognize the filename as
> FNAME?
You probably don't want to move this into the lexer in this way as it
will cause issues. For instance input like "a.b" in any code will be
recognised as a filename which likely isn't what you want.
You can either keep it in the parser or have the whole include
statement handled as a single token in the lexer, like:
IMPORT : '#include' WS* LT FILENAME GT ;
GT : '>' ;
LT : '<' ;
WORD : CHAR (CHAR|DIGIT|'_'|'-')*;
WS     : (' '|'\t'|'\n'|'\r')+ { self.skip(); } ;
fragment
FILENAME : WORD ('/' WORD)* '.' WORD ;

Also, note that I think your current grammar will cause issues as as
soon as ANTLR see's a '.' or '/' following a word it will assume it
must be a FILENAME. This is because ANTLR does not look past the end
of rules when predicting alternatives. So for instance "a/3" will
cause an error as upon seeing the '/' ANTLR will try and match a
filename which will fail. To solve this you need to combine the WORD
and FILENAME rules like:
WORD: WORD_PART ( ('/' WORD_PART)* '.' WORD_PART)?

fragment
WORD_PART:CHAR (CHAR|DIGIT|'_'|'-')*;

Tom.
>
> Regards,
> Bjoern
>