[antlr-interest] Lexer bug?

Mon Oct 22 07:55:58 PDT 2007

On 10/21/07, Gavin Lambert <antlr at mirality.co.nz> wrote:
> At 13:49 22/10/2007, Clifford Heath wrote:
>  >This rule consumes digits and one ".", then stops - and that's not
>  >a legal token.
>
> I've been complaining off and on about similar cases since the
> early betas.  Some useful discussion came up a while back that the
> predefined "Tokens" rule was being generated on the basis of
> matching only one token, and all the lookahead is generated from
> that same perspective; whereas if it were generated to match a
> sequence of tokens instead it generated better lookahead.
>
> <snip>
>
> Whenever this comes up (and it does come up a lot), the answer
> from Ter and Jim always seems to be "it's supposed to do that",
> and "rewrite your lexer rules" (usually involving syntactic
> predicates in some way).  My point is that the current behaviour
> is completely counter-intuitive (and hence "wrong", from my
> viewpoint), and while rewriting the lexer rules can often work
> around it, we shouldn't have to.  (And it's a lot messier.)
>

I have to say I agree with Gavin in this case. I'm new to antlr 3, but
rules that I just expected to work often require a lot of tomfoolery
to match what I'm expecting. I haven't ever used antlr2, but I have
been using the same types of cases in ragel where this occurs
naturally for you (as they generate FSMs for the input, so you just
have to look at the last final state when the next char isn't
accepted).

Kenny