[antlr-interest] C-style includes: problem with parser vs. lexer rules

Thomas Brandon tbrandonau at gmail.com
Mon Aug 27 07:29:30 PDT 2007

On 8/27/07, Bjoern Doebel <doebel at tudos.org> wrote:
> Hi,
> I want to parse C-style #include statements and got a working version like
> this:
> fragment DIGIT  : '0'..'9';
> fragment CHAR : 'a'..'z' | 'A'..'Z';
> IMPORT : '#include' ;
> GT : '>' ;
> LT : '<' ;
> WORD : CHAR (CHAR|DIGIT|'_'|'-')*;
> WS     : (' '|'\t'|'\n'|'\r')+ { self.skip(); } ;
> filename : WORD ('/' WORD)* '.' WORD ;
> import_r : IMPORT LT filename GT ;
> This works, but now I'd like to transfer the filename rule into a lexer
> rule, so I get only one single token from it. Therefore, I change the last
> two rules:
> FNAME : WORD ('/' WORD)* '.' WORD ;
> import_r : IMPORT LT FNAME GT;
> But when I run it with e.g., "#include <foo/bar/baz.h>", I get an error:
> line 1:8 mismatched input 'foo/baz/bar.h' expecting FNAME
> What am I doing wrong and why does the lexer not recognize the filename as
You probably don't want to move this into the lexer in this way as it
will cause issues. For instance input like "a.b" in any code will be
recognised as a filename which likely isn't what you want.
You can either keep it in the parser or have the whole include
statement handled as a single token in the lexer, like:
IMPORT : '#include' WS* LT FILENAME GT ;
GT : '>' ;
LT : '<' ;
WORD : CHAR (CHAR|DIGIT|'_'|'-')*;
WS     : (' '|'\t'|'\n'|'\r')+ { self.skip(); } ;

Also, note that I think your current grammar will cause issues as as
soon as ANTLR see's a '.' or '/' following a word it will assume it
must be a FILENAME. This is because ANTLR does not look past the end
of rules when predicting alternatives. So for instance "a/3" will
cause an error as upon seeing the '/' ANTLR will try and match a
filename which will fail. To solve this you need to combine the WORD
and FILENAME rules like:


> Regards,
> Bjoern

More information about the antlr-interest mailing list