[antlr-interest] Fundamental question on lexer rule ordering

Tue Feb 27 12:06:37 PST 2007

Hi!

I must be missing something fundamental on lexer rule ordering, because I 
keep running into the same problem over and over: re-ordering rules 
changes the lexer from "working" to "failing", and I don't understand why.

Note, this is not the same question as my previous posting on fragment.

Take this input text, it's got multiline comment in it:

int id;
int int_id;
int _int_id;
/*
  nothing
*/
45b32
6h87z

I have two lexers, one that work and one that fails. This one works:

lexer grammar DUMMY_Lexer;
INT        : 'int' ;
SEMI       : ';' ;
WS         : (  ' '| '\t'| '\r' | '\n' )+ {$channel=HIDDEN;} ;
IDENTIFIER : ('a'..'z'|'A'..'Z'|'_')+;
NUMBER     : DIGIT+ (BASE (DIGIT|'z'|'Z')+)? ;
ML_COMMENT : '/*' ( options {greedy=false;} : .)* '*/' {$channel=HIDDEN;} ;
fragment
BASE       : 'b' | 'h';
fragment
DIGIT      : '0'..'9';

This one does not work:

lexer grammar DUMMY_Lexer;
INT        : 'int' ;
SEMI       : ';' ;
WS         :  (  ' '| '\t'| '\r' | '\n' )+ {$channel=HIDDEN;} ;
ML_COMMENT : '/*' ( options {greedy=false;} : .)* '*/' {$channel=HIDDEN;} ;
IDENTIFIER : ('a'..'z'|'A'..'Z'|'_')+ ;
NUMBER     : DIGIT+ (BASE (DIGIT|'z'|'Z')+)? ;
fragment
BASE       : 'b' | 'h';
fragment
DIGIT      : '0'..'9';

The only difference is ML_COMMENT is in a different position. I can 
picture a machine consuming characters and trying to match tokens, but all 
these tokens I'm lexing are very different and I don't understand how the 
order could possibly matter in this case.

I'd really like to understand. I appologize if this is in the manual, I 
must have missed it.

Thanks!
Martin