[antlr-interest] Lexer bug? (with test cases!)

Tue Oct 23 21:13:43 PDT 2007

----- Original Message ----
> From: Terence Parr <parrt at cs.usfca.edu>
> To: Gavin Lambert <antlr at mirality.co.nz>
> Cc: antlr-interest at antlr.org
> Sent: Tuesday, October 23, 2007 6:15:22 PM
> Subject: Re: [antlr-interest] Lexer bug? (with test cases!)
> 
> 
> On Oct 23, 2007, at 8:16 PM, Gavin Lambert wrote:
> > Just to follow up on this, I ran a few tests just now (against  
> > 3.0.1) and inlining it doesn't help.  Here's a minimal reproduction  
> > grammar illustrating the problem:
> >
> > lexer grammar test;
> > NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)?;
> > OTHER: .;
> >
> > Given the input sequence "10..30", the lexer *should* produce  
> > "NUMBER[10] OTHER[.] OTHER[.] NUMBER[30]", but it doesn't.
> 
> Unfortunately for this situation, that is as I designed it; please  
> see my faq entry on how to solve a similar issue with range  
> operator.  ANTLR does not automatically backtrack in the lexer like  
> lex and other automata based lexers do.  Backtracking a DFA is  
> required to match what you want.  ANTLR simply predicts which rule  
> will win and proceeds with an LL parse. '.' matches anything so is 

Ter--

Take another look.  The '.' in the posted grammar is the character '.', not a wildcard; there is no ambiguity, just an LL(2) decision.  Unfortunately, the generated code makes an LL(1) decision and generates runtime errors as a result.  This is not a backtracking problem; note the selected workaround--it avoids having an epsilon alternative, but depends on k>1.

--Loring

 > ambig with NUMBER.  In lexers, ANTLR assumes you have prioritized the  
> rules in order so it hushes the warning and makes any digit predict  
> number.  This is absolutely consistent with ANTLR parsers and tree  
> parsers.  Done by same analyzer and code generator.  Please examine  
> the following parser rule:
> 
> test    :    DIGIT+ ('.' DIGIT+)?
>     |    .
>     ;
> 
> ANTLR cannot possibly decide which alt to choose upon 0..9, right?   
> It matches both alts.  ANTLR will complain.
> 
> I just love the discussion I see previously about how ANTLR is broken  
> and doesn't do the right thing etc...  Trust me folks, I've been  
> doing lexers, state machines etc... for a long time with the obsessed  
> focus of a lunatic. You may not want antlr to do what it's doing in  
> some cases, but I designed ANTLR to do precisely what it is doing in  
> this case.  Not to say there aren't any bugs in other situations ;)   
> There *is* a nasty analysis bug I have to track down at some point.   
> Can't narrow it down yet.
> 
> > Anyway, looking at this it's clear to see that it examines only one  
> > character of lookahead and basically decides that if there's a dot  
> > then the entire optional clause must be present
> 
> correct by how LL lexers (unlike lex lexers) work.
> 
> > -- despite the "minimal satisfying input" for that clause being a  
> > dot followed by at least one digit.  So its lookahead is clearly  
> > insufficient for the task.  (It would have been right if that were  
> > a * or ? instead of a +, though.)
> 
> Correct.  ANTLR would normally warn you about this problem, but in  
> lexers it chooses not to.
> 
> Remember folks: you can't just list a bunch of lexer rules that make  
> sense to a human and expect ANTLR to "make it so".  Same is true for  
> parsers, right?  The confusion arises since antlr doesn't warn you.
> 
> Ter
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com