[antlr-interest] Lexer lookahead overoptimizes
shmuel siegel
antlr at shmuelhome.mine.nu
Thu Apr 12 14:40:56 PDT 2007
Terence Parr wrote:
>
> On Apr 12, 2007, at 2:58 AM, Shmuel Siegel wrote:
>> Among other rules, I have these two lexer rules.
>> SHIN:
>> '\u00d7' '\u00a9' ('\u00d7' '\u0081')? ('\u00d7' '\u0082')?;
>> TUF:
>> '\u00d7' '\u00aa';
>>
>> The code produced for "SHIN" properly recognizes that the optional first
>>
>> parenthesis needs two terms to match but the second parenthesis will try
>>
>> to match if the first character matches. Therefore I get a recognition
>> exception from the sequence '\u00d7' '\u00a9' '\u00d7' '\u00aa'.
>
> Yep, ANTLR is doing what I would expect; just not would you want. ;)
>
> ANTLR decides that upon '\u00d7' '\u00a9' it should predict SHIN and
> upon '\u00d7' '\u00aa' it should predict TUF. Once inside SHIN, ANTLR
> cannot predict which token will come next. That's not something
> lexers specify (parsers do that). ANY token can follow. So, ANTLR
> matches greedily. You need a backtracking parser or maybe try k=2 in
> the second subrule.
>
> Ter
>
>
That leaves me confused as to what "?" means on a sequence. I thought
that it means optional sequence but find that here it only means
optional if the beginning of the sequence is not matched. It surprises
me that it is possible for an optional sequence to throw a recognition
exception. Why is the '\u00d7' optional but not the full sequence?
More information about the antlr-interest
mailing list