[antlr-interest] Problem with simple tokens

Sat Aug 23 03:54:51 PDT 2008

At 21:11 23/08/2008, Markus Stoeger wrote:
 >rule1: Foo ('.' | '!');
 >
 >Foo: 'foo';
 >Identifier: 'a'..'z'+ ('.' 'a'..'z'+)*;
 >--- CUT ---
 >
 >When running that in the debugger it matches "foo!" but not 
"foo.",
 >which causes a MismatchedTokenException.
 >
 >Why doesn't it match "foo."?
 >
 >It has something to do with the Identifier token (which contains 

 >a dot) but I don't understand why.. note that to match as
 >Identifier the dot would have to be followed by at least one
 >letter, which isn't the case with "foo.". Also the token Foo
 >should have precedence over the token Identifier as it is
 >defined earlier.

As I explained earlier today, the ANTLR lexer only looks ahead 
just as far as it thinks it needs to in order to disambiguate the 
alternatives -- and that's not always far enough to get it 
"right".

In this case, what's happening is that the input 'foo' could match 
either Foo or Identifier; by itself ANTLR will choose Foo, since 
it's listed first -- but when given the input 'foo.', this could 
either be "Foo '.'" or "Identifier" (admittedly not a complete 
Identifier, but it doesn't realise that yet), so it'll pick 
Identifier since it consumes more of the input in one go.

You can force ANTLR to use extra lookahead with the slightly more 
verbose:

Identifier: 'a'..'z'+ (('.' 'a'..'z') => '.' 'a'..'z'+)*;