[antlr-interest] Lexer lookahead overoptimizes

Thu Apr 12 12:09:13 PDT 2007

On Apr 12, 2007, at 2:58 AM, Shmuel Siegel wrote:
> Among other rules, I have these two lexer rules.
>  SHIN:
>     '\u00d7' '\u00a9' ('\u00d7' '\u0081')? ('\u00d7' '\u0082')?;
>  TUF:
>     '\u00d7' '\u00aa';
>
> The code produced for "SHIN" properly recognizes that the optional  
> first
>
> parenthesis needs two terms to match but the second parenthesis  
> will try
>
> to match if the first character matches. Therefore I get a recognition
> exception from the sequence '\u00d7' '\u00a9' '\u00d7' '\u00aa'.

Yep, ANTLR is doing what I would expect; just not would you want. ;)

ANTLR decides that upon '\u00d7' '\u00a9' it should predict SHIN and  
upon '\u00d7' '\u00aa' it should predict TUF.  Once inside SHIN,  
ANTLR cannot predict which token will come next.  That's not  
something lexers specify (parsers do that).  ANY token can follow.   
So, ANTLR matches greedily.  You need a backtracking parser or maybe  
try k=2 in the second subrule.

Ter