[antlr-interest] Lexer lookahead overoptimizes

Thu Apr 12 14:40:56 PDT 2007

Terence Parr wrote:
>
> On Apr 12, 2007, at 2:58 AM, Shmuel Siegel wrote:
>> Among other rules, I have these two lexer rules.
>>  SHIN:
>>     '\u00d7' '\u00a9' ('\u00d7' '\u0081')? ('\u00d7' '\u0082')?;
>>  TUF:
>>     '\u00d7' '\u00aa';
>>
>> The code produced for "SHIN" properly recognizes that the optional first
>>
>> parenthesis needs two terms to match but the second parenthesis will try
>>
>> to match if the first character matches. Therefore I get a recognition
>> exception from the sequence '\u00d7' '\u00a9' '\u00d7' '\u00aa'.
>
> Yep, ANTLR is doing what I would expect; just not would you want. ;)
>
> ANTLR decides that upon '\u00d7' '\u00a9' it should predict SHIN and 
> upon '\u00d7' '\u00aa' it should predict TUF.  Once inside SHIN, ANTLR 
> cannot predict which token will come next.  That's not something 
> lexers specify (parsers do that).  ANY token can follow.  So, ANTLR 
> matches greedily.  You need a backtracking parser or maybe try k=2 in 
> the second subrule.
>
> Ter
>
>
That leaves me confused as to what "?" means on a sequence. I thought 
that it means optional sequence but find that here it only means 
optional if the beginning of the sequence is not matched. It surprises 
me that it is possible for an optional sequence to throw a recognition 
exception. Why is the '\u00d7' optional but not the full sequence?