[antlr-interest] Lexer bug? (with test cases!)
Terence Parr
parrt at cs.usfca.edu
Tue Oct 23 18:15:22 PDT 2007
On Oct 23, 2007, at 8:16 PM, Gavin Lambert wrote:
> Just to follow up on this, I ran a few tests just now (against
> 3.0.1) and inlining it doesn't help. Here's a minimal reproduction
> grammar illustrating the problem:
>
> lexer grammar test;
> NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)?;
> OTHER: .;
>
> Given the input sequence "10..30", the lexer *should* produce
> "NUMBER[10] OTHER[.] OTHER[.] NUMBER[30]", but it doesn't.
Unfortunately for this situation, that is as I designed it; please
see my faq entry on how to solve a similar issue with range
operator. ANTLR does not automatically backtrack in the lexer like
lex and other automata based lexers do. Backtracking a DFA is
required to match what you want. ANTLR simply predicts which rule
will win and proceeds with an LL parse. '.' matches anything so is
ambig with NUMBER. In lexers, ANTLR assumes you have prioritized the
rules in order so it hushes the warning and makes any digit predict
number. This is absolutely consistent with ANTLR parsers and tree
parsers. Done by same analyzer and code generator. Please examine
the following parser rule:
test : DIGIT+ ('.' DIGIT+)?
| .
;
ANTLR cannot possibly decide which alt to choose upon 0..9, right?
It matches both alts. ANTLR will complain.
I just love the discussion I see previously about how ANTLR is broken
and doesn't do the right thing etc... Trust me folks, I've been
doing lexers, state machines etc... for a long time with the obsessed
focus of a lunatic. You may not want antlr to do what it's doing in
some cases, but I designed ANTLR to do precisely what it is doing in
this case. Not to say there aren't any bugs in other situations ;)
There *is* a nasty analysis bug I have to track down at some point.
Can't narrow it down yet.
> Anyway, looking at this it's clear to see that it examines only one
> character of lookahead and basically decides that if there's a dot
> then the entire optional clause must be present
correct by how LL lexers (unlike lex lexers) work.
> -- despite the "minimal satisfying input" for that clause being a
> dot followed by at least one digit. So its lookahead is clearly
> insufficient for the task. (It would have been right if that were
> a * or ? instead of a +, though.)
Correct. ANTLR would normally warn you about this problem, but in
lexers it chooses not to.
Remember folks: you can't just list a bunch of lexer rules that make
sense to a human and expect ANTLR to "make it so". Same is true for
parsers, right? The confusion arises since antlr doesn't warn you.
Ter
More information about the antlr-interest
mailing list