[antlr-interest] Lexer fails

Thu Jan 26 23:25:29 PST 2012

At 14:27 27/01/2012, Peter Piper wrote:
 >I'm sorry that I can only talk about the old stuff (v3) but can
 >anyone see how the following lexer token definition:
 >
 >FLOAT : ('0'..'9')+ ( '.' ('0'..'9')* )? ('E' | 'e') ('-')?
 >('0'..'9')+ ;
[...]
 >
 >There is no 'e' or 'E' in the input, so why does the ANTLR lexer 

 >think that this is a "better" token to output than the other one 

 >I want it to go for, namely:
 >
 >FIXED : ('0'..'9')+ '.' ('0'..'9')* ;

v3 lexers mostly just use single-char lookahead when around 
looping constructs, which isn't sufficient to disambiguate these 
cases.  You need to help it out a bit by providing explicit 
lookahead hints.  (Reportedly v4 is infinitely better at this, but 
I haven't tried it myself yet.)

fragment FLOAT : ('0'..'9')+ ( '.' ('0'..'9')* )? ('E' | 'e') 
('-')? ('0'..'9')+;

FIXED : (FLOAT) => FLOAT { $type = FLOAT; }
       | ('0'..'9')+ '.' ('0'..'9')*
       ;

Or left-factor it for more efficiency (and throw an INTEGER in, 
since I assume you have one of those too):

fragment FLOAT : ;
fragment FIXED : ;

INTEGER : ('0'..'9')+
         ( ('.' ('0'..'9')) => '.' ('0'..'9')* { $type = FIXED; }
         ( ('E'|'e') '-'? ('0'..'9')+ { $type = FLOAT; } )? )?
         ;

Or just call all of these things NUMBERs and sort it out in the 
parser. :)