[antlr-interest] Lexer bug?

Sun Oct 21 20:02:24 PDT 2007

At 13:49 22/10/2007, Clifford Heath wrote:
 >This rule consumes digits and one ".", then stops - and that's 
not
 >a legal token.

I've been complaining off and on about similar cases since the 
early betas.  Some useful discussion came up a while back that the 
predefined "Tokens" rule was being generated on the basis of 
matching only one token, and all the lookahead is generated from 
that same perspective; whereas if it were generated to match a 
sequence of tokens instead it generated better lookahead.

Consider the source rules again for a moment:
DOTTY : '..';
NUMBER
:       SIGN? DIGIT+ FRACTION? EXPONENT?
|       SIGN? FRACTION EXPONENT?
;
fragment SIGN:          ('+' | '-');
fragment FRACTION:      '.' DIGIT+;
fragment EXPONENT:      ('e'|'E') SIGN? DIGIT+;
fragment DIGIT  :       '0'..'9';

To enter this rule, there must be one of these things:
   1. a SIGN followed by at least one DIGIT
   2. a SIGN followed by a FRACTION (which is a dot and at least 
one DIGIT)
   3. at least one DIGIT
   4. a FRACTION (which is a dot and at least one DIGIT)

As soon as you see any of those four cases, you can be reasonably 
sure that you've got yourself a NUMBER (unless you're also trying 
to produce longest-match-wins, in which case you have to entertain 
the possibility it's something else).

So, ok.  The input at this point is "10..20".  This starts with a 
sequence of digits, which matches option 3, so that's fine, it's a 
NUMBER.  So the lexer consumes all the digits and completes the 
first loop.  Everything is all good up to this point.

This is where things go wrong, though.  In reality, the things 
that ought to be permissible next are:
   1. end of token (everything following the DIGIT loop is 
optional, after all)
   2. a FRACTION (a dot and at least one DIGIT)
   3. an EXPONENT (starting with an e or E)
   4. a FRACTION followed by an EXPONENT

ANTLR currently looks ahead and sees a dot.  At this point, it 
immediately assumes that it's got path #2 and tries to match a 
FRACTION.  (Which it will eventually manage to do, but only by 
discarding some of the input stream, which is naughty.)

The "correct" answer should have been #1, to end the token at that 
point (because it is perfectly valid, after all).  After doing 
that, it would then see two dots and generate a DOTTY, then 
another sequence of DIGITs and generate another NUMBER.  Which is 
exactly what is desired here.

So the assumption it made when seeing the dot was faulty.  It was 
correct in that after seeing the dot the only thing that could 
legally follow it would be a DIGIT loop (within that same token), 
but it completely ignored the fact that it wasn't required to 
consume the dot at all, since it was part of an optional clause.

Whenever this comes up (and it does come up a lot), the answer 
from Ter and Jim always seems to be "it's supposed to do that", 
and "rewrite your lexer rules" (usually involving syntactic 
predicates in some way).  My point is that the current behaviour 
is completely counter-intuitive (and hence "wrong", from my 
viewpoint), and while rewriting the lexer rules can often work 
around it, we shouldn't have to.  (And it's a lot messier.)