[antlr-interest] Lexer bug? (with test cases!)
Loring Craymer
lgcraymer at yahoo.com
Tue Oct 23 21:13:43 PDT 2007
----- Original Message ----
> From: Terence Parr <parrt at cs.usfca.edu>
> To: Gavin Lambert <antlr at mirality.co.nz>
> Cc: antlr-interest at antlr.org
> Sent: Tuesday, October 23, 2007 6:15:22 PM
> Subject: Re: [antlr-interest] Lexer bug? (with test cases!)
>
>
> On Oct 23, 2007, at 8:16 PM, Gavin Lambert wrote:
> > Just to follow up on this, I ran a few tests just now (against
> > 3.0.1) and inlining it doesn't help. Here's a minimal reproduction
> > grammar illustrating the problem:
> >
> > lexer grammar test;
> > NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)?;
> > OTHER: .;
> >
> > Given the input sequence "10..30", the lexer *should* produce
> > "NUMBER[10] OTHER[.] OTHER[.] NUMBER[30]", but it doesn't.
>
> Unfortunately for this situation, that is as I designed it; please
> see my faq entry on how to solve a similar issue with range
> operator. ANTLR does not automatically backtrack in the lexer like
> lex and other automata based lexers do. Backtracking a DFA is
> required to match what you want. ANTLR simply predicts which rule
> will win and proceeds with an LL parse. '.' matches anything so is
Ter--
Take another look. The '.' in the posted grammar is the character '.', not a wildcard; there is no ambiguity, just an LL(2) decision. Unfortunately, the generated code makes an LL(1) decision and generates runtime errors as a result. This is not a backtracking problem; note the selected workaround--it avoids having an epsilon alternative, but depends on k>1.
--Loring
> ambig with NUMBER. In lexers, ANTLR assumes you have prioritized the
> rules in order so it hushes the warning and makes any digit predict
> number. This is absolutely consistent with ANTLR parsers and tree
> parsers. Done by same analyzer and code generator. Please examine
> the following parser rule:
>
> test : DIGIT+ ('.' DIGIT+)?
> | .
> ;
>
> ANTLR cannot possibly decide which alt to choose upon 0..9, right?
> It matches both alts. ANTLR will complain.
>
> I just love the discussion I see previously about how ANTLR is broken
> and doesn't do the right thing etc... Trust me folks, I've been
> doing lexers, state machines etc... for a long time with the obsessed
> focus of a lunatic. You may not want antlr to do what it's doing in
> some cases, but I designed ANTLR to do precisely what it is doing in
> this case. Not to say there aren't any bugs in other situations ;)
> There *is* a nasty analysis bug I have to track down at some point.
> Can't narrow it down yet.
>
> > Anyway, looking at this it's clear to see that it examines only one
> > character of lookahead and basically decides that if there's a dot
> > then the entire optional clause must be present
>
> correct by how LL lexers (unlike lex lexers) work.
>
> > -- despite the "minimal satisfying input" for that clause being a
> > dot followed by at least one digit. So its lookahead is clearly
> > insufficient for the task. (It would have been right if that were
> > a * or ? instead of a +, though.)
>
> Correct. ANTLR would normally warn you about this problem, but in
> lexers it chooses not to.
>
> Remember folks: you can't just list a bunch of lexer rules that make
> sense to a human and expect ANTLR to "make it so". Same is true for
> parsers, right? The confusion arises since antlr doesn't warn you.
>
> Ter
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
More information about the antlr-interest
mailing list