[antlr-interest] Lexer lookahead overoptimizes

Thu Apr 12 14:54:09 PDT 2007

At 09:40 13/04/2007, shmuel siegel wrote:
 >Terence Parr wrote:
 >> Yep, ANTLR is doing what I would expect; just not would you 
want.
 >>;)
 >>
 >> ANTLR decides that upon '\u00d7' '\u00a9' it should predict 
SHIN
 >> and upon '\u00d7' '\u00aa' it should predict TUF.  Once inside 

 >> SHIN, ANTLR cannot predict which token will come next.  That's 

 >> not something lexers specify (parsers do that).  ANY token can 

 >> follow.  So, ANTLR matches greedily.  You need a backtracking
 >> parser or maybe try k=2 in the second subrule.
 >>
 >That leaves me confused as to what "?" means on a sequence. I
 >thought that it means optional sequence but find that here it
 >only means optional if the beginning of the sequence is not
 >matched. It surprises me that it is possible for an optional
 >sequence to throw a recognition exception. Why is the '\u00d7'
 >optional but not the full sequence?

Try removing any k= options from your lexer; you shouldn't need 
them in v3 since by default it's LL(*), which should work this 
sort of thing out automatically.  (At least as I understand it.)

With the input '\u00d7' '\u00a9' '\u00d7' '\u00aa', ANTLR should 
be predicting SHIN based on the first two characters, then (still 
within SHIN), test *both* the following characters and see they 
don't match either of the optional clauses in SHIN, so should 
generate the token and exit SHIN.  Then the next call should pick 
up those unconsumed characters as a TUF, which is what you'd 
expect.

If you've set k=1, though, that'll prevent it from doing 
sufficient lookahead to resolve this case, so you'll get the sort 
of problems you're describing.

(I don't know whether ANTLR is actually working this way or not, 
of course -- if it still doesn't work when you haven't specified 
any k= options then I would consider it a bug.)