[antlr-interest] Lexer bug? (with test cases!)

Tue Oct 23 18:15:22 PDT 2007

On Oct 23, 2007, at 8:16 PM, Gavin Lambert wrote:
> Just to follow up on this, I ran a few tests just now (against  
> 3.0.1) and inlining it doesn't help.  Here's a minimal reproduction  
> grammar illustrating the problem:
>
> lexer grammar test;
> NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)?;
> OTHER: .;
>
> Given the input sequence "10..30", the lexer *should* produce  
> "NUMBER[10] OTHER[.] OTHER[.] NUMBER[30]", but it doesn't.

Unfortunately for this situation, that is as I designed it; please  
see my faq entry on how to solve a similar issue with range  
operator.  ANTLR does not automatically backtrack in the lexer like  
lex and other automata based lexers do.  Backtracking a DFA is  
required to match what you want.  ANTLR simply predicts which rule  
will win and proceeds with an LL parse. '.' matches anything so is  
ambig with NUMBER.  In lexers, ANTLR assumes you have prioritized the  
rules in order so it hushes the warning and makes any digit predict  
number.  This is absolutely consistent with ANTLR parsers and tree  
parsers.  Done by same analyzer and code generator.  Please examine  
the following parser rule:

test	:	DIGIT+ ('.' DIGIT+)?
	|	.
	;

ANTLR cannot possibly decide which alt to choose upon 0..9, right?   
It matches both alts.  ANTLR will complain.

I just love the discussion I see previously about how ANTLR is broken  
and doesn't do the right thing etc...  Trust me folks, I've been  
doing lexers, state machines etc... for a long time with the obsessed  
focus of a lunatic. You may not want antlr to do what it's doing in  
some cases, but I designed ANTLR to do precisely what it is doing in  
this case.  Not to say there aren't any bugs in other situations ;)   
There *is* a nasty analysis bug I have to track down at some point.   
Can't narrow it down yet.

> Anyway, looking at this it's clear to see that it examines only one  
> character of lookahead and basically decides that if there's a dot  
> then the entire optional clause must be present

correct by how LL lexers (unlike lex lexers) work.

> -- despite the "minimal satisfying input" for that clause being a  
> dot followed by at least one digit.  So its lookahead is clearly  
> insufficient for the task.  (It would have been right if that were  
> a * or ? instead of a +, though.)

Correct.  ANTLR would normally warn you about this problem, but in  
lexers it chooses not to.

Remember folks: you can't just list a bunch of lexer rules that make  
sense to a human and expect ANTLR to "make it so".  Same is true for  
parsers, right?  The confusion arises since antlr doesn't warn you.

Ter