[antlr-interest] Lexer lookahead overoptimizes
Gavin Lambert
antlr at mirality.co.nz
Thu Apr 12 14:54:09 PDT 2007
At 09:40 13/04/2007, shmuel siegel wrote:
>Terence Parr wrote:
>> Yep, ANTLR is doing what I would expect; just not would you
want.
>>;)
>>
>> ANTLR decides that upon '\u00d7' '\u00a9' it should predict
SHIN
>> and upon '\u00d7' '\u00aa' it should predict TUF. Once inside
>> SHIN, ANTLR cannot predict which token will come next. That's
>> not something lexers specify (parsers do that). ANY token can
>> follow. So, ANTLR matches greedily. You need a backtracking
>> parser or maybe try k=2 in the second subrule.
>>
>That leaves me confused as to what "?" means on a sequence. I
>thought that it means optional sequence but find that here it
>only means optional if the beginning of the sequence is not
>matched. It surprises me that it is possible for an optional
>sequence to throw a recognition exception. Why is the '\u00d7'
>optional but not the full sequence?
Try removing any k= options from your lexer; you shouldn't need
them in v3 since by default it's LL(*), which should work this
sort of thing out automatically. (At least as I understand it.)
With the input '\u00d7' '\u00a9' '\u00d7' '\u00aa', ANTLR should
be predicting SHIN based on the first two characters, then (still
within SHIN), test *both* the following characters and see they
don't match either of the optional clauses in SHIN, so should
generate the token and exit SHIN. Then the next call should pick
up those unconsumed characters as a TUF, which is what you'd
expect.
If you've set k=1, though, that'll prevent it from doing
sufficient lookahead to resolve this case, so you'll get the sort
of problems you're describing.
(I don't know whether ANTLR is actually working this way or not,
of course -- if it still doesn't work when you haven't specified
any k= options then I would consider it a bug.)
More information about the antlr-interest
mailing list