[antlr-interest] Lexer bug? (with test cases!)

Tue Oct 23 05:16:36 PDT 2007

At 19:21 22/10/2007, Loring Craymer wrote:
 >Um--I don't think that this is quite right.  ANTLR 3 has an
 >inelegant tendency to make k=1 decisions when it should not.
 >Specifically:  any time there is an epsilon alternative--as in
 >FRACTION?--ANTLR tends to make a k=1 decision, as in "I see a 
'.';
 >therefore, this is a FRACTION" in Austin's NUMBER rule.  From my 

 >perspective, this is probably a bug in the LL* implementation: a 

 >lookahead DFA should be generated for such cases (to replace the 

 >"if (LA(1) == '.') mFRACTION()") that does the right thing.
[...]
 >If the FRACTION rule is inlined, ANTLR 3 will probably do the 
right
 >thing (I have not tested this example, but have had to resort to 

 >inlining in other cases).  Again, this is indicative that Austin 
is
 >correct in his assertion that this is a bug:  there should be no 

 >difference between rule invocations and the equivalent inlined
 >token or character sequences.

Just to follow up on this, I ran a few tests just now (against 
3.0.1) and inlining it doesn't help.  Here's a minimal 
reproduction grammar illustrating the problem:

lexer grammar test;
NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)?;
OTHER: .;

Given the input sequence "10..30", the lexer *should* produce 
"NUMBER[10] OTHER[.] OTHER[.] NUMBER[30]", but it doesn't.  (I'm 
not quite sure what it does do, since there aren't really any 
facilities at the moment for debugging lexers short of writing 
some custom code, and I didn't have that much spare time.)

Looking at the generated code for this, right after matching the 
first set of digits it has this code:

             // test.g:4:16: ( '.' ( '0' .. '9' )+ )?
             int alt3=2;
             int LA3_0 = input.LA(1);

             if ( (LA3_0=='.') ) {
                 alt3=1;
             }
             switch (alt3) {
                 case 1 :
                     // 
D:\\Programming\\antlr\\tests\\test4.g:4:17: '.' ( '0' .. '9' )+
                     {
                     match('.');
                     // 
D:\\Programming\\antlr\\tests\\test4.g:4:21: ( '0' .. '9' )+
                     int cnt2=0;
                     loop2:
                     do {
                         int alt2=2;
                         int LA2_0 = input.LA(1);

                         if ( ((LA2_0>='0' && LA2_0<='9')) ) {
                             alt2=1;
                         }

                         switch (alt2) {
                     	case 1 :
                     	    // 
D:\\Programming\\antlr\\tests\\test4.g:4:22: '0' .. '9'
                     	    {
                     	    matchRange('0','9');

                     	    }
                     	    break;

                     	default :
                     	    if ( cnt2 >= 1 ) break loop2;
                                 EarlyExitException eee =
                                     new EarlyExitException(2, 
input);
                                 throw eee;
                         }
                         cnt2++;
                     } while (true);
                     }
                     break;
             }

(And yes, the indentation is a bit messed up.  This is how it 
appeared in ANTLRworks though.)

Anyway, looking at this it's clear to see that it examines only 
one character of lookahead and basically decides that if there's a 
dot then the entire optional clause must be present -- despite the 
"minimal satisfying input" for that clause being a dot followed by 
at least one digit.  So its lookahead is clearly insufficient for 
the task.  (It would have been right if that were a * or ? instead 
of a +, though.)

Interestingly, it sometimes seems to get it right.  For example, 
in this test grammar:

lexer grammar test2;
NUMBER: ('+' | '-')? ('0'..'9')+ ('.' ('0'..'9')+)?;
OTHER: .;

... the mTokens function does up to two characters of lookahead 
(if the first characters is a digit, then it's a NUMBER, otherwise 
if the first character is a plus or minus, *and* the second 
character is a digit, then it's a NUMBER, otherwise it's an 
OTHER), which is perfectly correct.  But then *this* grammar:

lexer grammar test3;
ONE	: 'one';
TWO : 'two';
OTHER: .;

... is wrong again.  Here, it also tests up to two characters of 
lookahead, which means that the input "onf" will be wrongly 
interpreted as a ONE and will produce an error rather than 
generating three OTHER tokens.