[antlr-interest] Lexer bug? (with test cases!)

Wed Oct 24 04:43:55 PDT 2007

Terence Parr wrote:
>
> It appears it's not backtracking because only 1 extra char is enough, 
> but it cannot be statically determined if we assume any char can 
> follow.  I think you could make the subrule more complicated and it 
> would require 2 extra then 3 then 4 etc...

I think that inserting a catch{} block for optional subrules would go a 
long way toward eliminating this particular aspect of the problem. When 
you don't match correctly, throw an exception, catch it, and reset. The 
consumption of the input becomes temporary until the token completes, 
but that's straightforward.

>> This is because the clause ('.' DIGIT+)? must either match a dot 
>> followed by at least one digit (minimum two characters) or match 
>> nothing at all.  Seeing the dot and then assuming the following 
>> character is a DIGIT without actually checking it first just isn't 
>> right.
>
> Isn't right in that it won't do what is natural.  I couldnt' make 
> antlr do the right thing easily in this case.

Freudian slip? ;)

>> Similarly, if the construct was ('0' 'x' DIGIT+)? then it must check 
>> a minimum of three characters before matching the clause as a whole.  
>> For ('.' DIGIT*)?, the minimum is only one character.
>
> Yep.  Now make that 'x'+ and you have arbitrary lookahead.  Now make 
> it a rule+ ref and you have non-regular arbitrary lookahead etc...
>
>> For ('.' DIGIT+ FOO)?, it must have unbounded lookahead
>
> ha!  you're ahead of me. yep.

catch{}
>> >correct by how LL lexers (unlike lex lexers) work.
>>
>> But LL(*) is supposed to look ahead as many characters as it needs to 
>> in order to determine that it really is a match, isn't it?  Otherwise 
>> this is a step backwards from v2, where you could at least define an 
>> LL(k) lookahead limit greater than 1.
>
> yep, but the fog after a lexer rule defeats it.
>

Perhaps it is because I am a bear of very little brain, but I don't get 
this. My understanding of LL(*) is "Antlr will figure out the right 
value of k for this case." But k=1 is not always the right value of k.

>> >Remember folks: you can't just list a bunch of lexer rules that
>> >make sense to a human and expect ANTLR to "make it so".
>>
>> Why not?  This case seems fairly straightforward.
>
> Not to me and I spent 4 years making LL(*) work. ;)

The original question was regarding Number = DIGIT+ ('.' DIGIT+)? and 
how LL(*) was deciding that '.' implied the entire chain, when it 
didn't. Perhaps another year would help, but I don't see how this counts 
as working.

>> Any time there's an optional block (whether ? or *), ANTLR should use 
>> sufficient lookahead (where "sufficient" might mean "unbounded", as 
>> in the example above)
>
> Can you tell me how to compute the static LL lookahead?  If not, then 
> trust that I know it's not possible or at least too much for my brain. 
> ;)  It's the fog that causes the problem for LL(*).  I could backtrack 
> automatically upon error like lex but it would be VERY slow in a 
> recursive descent lexer.
Why would it be slow? And if it's slow, so what? Right now, you're 
making decisions which sometimes are right and sometimes are not right. 
If they are not right, you throw an exception and discard a bunch of the 
input and pick up again. If you were to throw an exception and NOT 
discard part of the input but terminate the token early, the only 
performance time difference seems like it would be the time spent 
recognizing the bits of the input you were going to throw away in the 
present-day case.

Besides, isn't that what backtrack=... is for? :-) Maybe there's a 
lexer::backtrack switch needed? Or maybe k needs to be * instead of 1.

> Have you built a lex like creature before?  I have back in 1988.  I 
> believe I'm remembering how it operates correctly.
>
I don't think it matters, does it? You're in the recursive-descent 
business, not the state-machine business. You sell chicken, they sell 
hamburgers.

=Austin