[antlr-interest] Lexer bug? (with test cases!)

Wed Oct 24 03:05:51 PDT 2007

On Oct 24, 2007, at 5:52 PM, Gavin Lambert wrote:

> At 14:15 24/10/2007, Terence Parr wrote:
> >ANTLR does not automatically backtrack in the lexer like
> >lex and other automata based lexers do.  Backtracking a DFA is
> >required to match what you want.  ANTLR simply predicts which
> >rule will win and proceeds with an LL parse.
>
> As Loring pointed out, no backtracking is required -- just a two- 
> character lookahead instead of a 1-character lookahead.

It appears it's not backtracking because only 1 extra char is enough,  
but it cannot be statically determined if we assume any char can  
follow.  I think you could make the subrule more complicated and it  
would require 2 extra then 3 then 4 etc...

> This is because the clause ('.' DIGIT+)? must either match a dot  
> followed by at least one digit (minimum two characters) or match  
> nothing at all.  Seeing the dot and then assuming the following  
> character is a DIGIT without actually checking it first just isn't  
> right.

Isn't right in that it won't do what is natural.  I couldnt' make  
antlr do the right thing easily in this case.

> Similarly, if the construct was ('0' 'x' DIGIT+)? then it must  
> check a minimum of three characters before matching the clause as a  
> whole.  For ('.' DIGIT*)?, the minimum is only one character.

Yep.  Now make that 'x'+ and you have arbitrary lookahead.  Now make  
it a rule+ ref and you have non-regular arbitrary lookahead etc...

> For ('.' DIGIT+ FOO)?, it must have unbounded lookahead

ha!  you're ahead of me. yep.

> -- it must verify that there really is a FOO after all the DIGITs  
> before it can accept the clause.  This is the tricky case, but it  
> is the case that was supposed to be dealt with automatically by  
> ANTLR with the switch to LL(*) [as I understood it].

The issue is that there is an implied (ANYTOKEN)* and really wildcard  
following any rule.  I can look at it again sometime, but i am unsure  
how to make antlr see farther into the fog of "there are infinite  
unspecified char next".

> >correct by how LL lexers (unlike lex lexers) work.
>
> But LL(*) is supposed to look ahead as many characters as it needs  
> to in order to determine that it really is a match, isn't it?   
> Otherwise this is a step backwards from v2, where you could at  
> least define an LL(k) lookahead limit greater than 1.

yep, but the fog after a lexer rule defeats it.

> >Remember folks: you can't just list a bunch of lexer rules that
> >make sense to a human and expect ANTLR to "make it so".
>
> Why not?  This case seems fairly straightforward.

Not to me and I spent 4 years making LL(*) work. ;)

> Any time there's an optional block (whether ? or *), ANTLR should  
> use sufficient lookahead (where "sufficient" might mean  
> "unbounded", as in the example above)

Can you tell me how to compute the static LL lookahead?  If not, then  
trust that I know it's not possible or at least too much for my  
brain. ;)  It's the fog that causes the problem for LL(*).  I could  
backtrack automatically upon error like lex but it would be VERY slow  
in a recursive descent lexer.

Have you built a lex like creature before?  I have back in 1988.  I  
believe I'm remembering how it operates correctly.

Ter