[antlr-interest] Lexer bug?
Gavin Lambert
antlr at mirality.co.nz
Sun Oct 21 20:02:24 PDT 2007
At 13:49 22/10/2007, Clifford Heath wrote:
>This rule consumes digits and one ".", then stops - and that's
not
>a legal token.
I've been complaining off and on about similar cases since the
early betas. Some useful discussion came up a while back that the
predefined "Tokens" rule was being generated on the basis of
matching only one token, and all the lookahead is generated from
that same perspective; whereas if it were generated to match a
sequence of tokens instead it generated better lookahead.
Consider the source rules again for a moment:
DOTTY : '..';
NUMBER
: SIGN? DIGIT+ FRACTION? EXPONENT?
| SIGN? FRACTION EXPONENT?
;
fragment SIGN: ('+' | '-');
fragment FRACTION: '.' DIGIT+;
fragment EXPONENT: ('e'|'E') SIGN? DIGIT+;
fragment DIGIT : '0'..'9';
To enter this rule, there must be one of these things:
1. a SIGN followed by at least one DIGIT
2. a SIGN followed by a FRACTION (which is a dot and at least
one DIGIT)
3. at least one DIGIT
4. a FRACTION (which is a dot and at least one DIGIT)
As soon as you see any of those four cases, you can be reasonably
sure that you've got yourself a NUMBER (unless you're also trying
to produce longest-match-wins, in which case you have to entertain
the possibility it's something else).
So, ok. The input at this point is "10..20". This starts with a
sequence of digits, which matches option 3, so that's fine, it's a
NUMBER. So the lexer consumes all the digits and completes the
first loop. Everything is all good up to this point.
This is where things go wrong, though. In reality, the things
that ought to be permissible next are:
1. end of token (everything following the DIGIT loop is
optional, after all)
2. a FRACTION (a dot and at least one DIGIT)
3. an EXPONENT (starting with an e or E)
4. a FRACTION followed by an EXPONENT
ANTLR currently looks ahead and sees a dot. At this point, it
immediately assumes that it's got path #2 and tries to match a
FRACTION. (Which it will eventually manage to do, but only by
discarding some of the input stream, which is naughty.)
The "correct" answer should have been #1, to end the token at that
point (because it is perfectly valid, after all). After doing
that, it would then see two dots and generate a DOTTY, then
another sequence of DIGITs and generate another NUMBER. Which is
exactly what is desired here.
So the assumption it made when seeing the dot was faulty. It was
correct in that after seeing the dot the only thing that could
legally follow it would be a DIGIT loop (within that same token),
but it completely ignored the fact that it wasn't required to
consume the dot at all, since it was part of an optional clause.
Whenever this comes up (and it does come up a lot), the answer
from Ter and Jim always seems to be "it's supposed to do that",
and "rewrite your lexer rules" (usually involving syntactic
predicates in some way). My point is that the current behaviour
is completely counter-intuitive (and hence "wrong", from my
viewpoint), and while rewriting the lexer rules can often work
around it, we shouldn't have to. (And it's a lot messier.)
More information about the antlr-interest
mailing list