[antlr-interest] Lexer bug? (with test cases!)
Terence Parr
parrt at cs.usfca.edu
Wed Oct 24 03:05:51 PDT 2007
On Oct 24, 2007, at 5:52 PM, Gavin Lambert wrote:
> At 14:15 24/10/2007, Terence Parr wrote:
> >ANTLR does not automatically backtrack in the lexer like
> >lex and other automata based lexers do. Backtracking a DFA is
> >required to match what you want. ANTLR simply predicts which
> >rule will win and proceeds with an LL parse.
>
> As Loring pointed out, no backtracking is required -- just a two-
> character lookahead instead of a 1-character lookahead.
It appears it's not backtracking because only 1 extra char is enough,
but it cannot be statically determined if we assume any char can
follow. I think you could make the subrule more complicated and it
would require 2 extra then 3 then 4 etc...
> This is because the clause ('.' DIGIT+)? must either match a dot
> followed by at least one digit (minimum two characters) or match
> nothing at all. Seeing the dot and then assuming the following
> character is a DIGIT without actually checking it first just isn't
> right.
Isn't right in that it won't do what is natural. I couldnt' make
antlr do the right thing easily in this case.
> Similarly, if the construct was ('0' 'x' DIGIT+)? then it must
> check a minimum of three characters before matching the clause as a
> whole. For ('.' DIGIT*)?, the minimum is only one character.
Yep. Now make that 'x'+ and you have arbitrary lookahead. Now make
it a rule+ ref and you have non-regular arbitrary lookahead etc...
> For ('.' DIGIT+ FOO)?, it must have unbounded lookahead
ha! you're ahead of me. yep.
> -- it must verify that there really is a FOO after all the DIGITs
> before it can accept the clause. This is the tricky case, but it
> is the case that was supposed to be dealt with automatically by
> ANTLR with the switch to LL(*) [as I understood it].
The issue is that there is an implied (ANYTOKEN)* and really wildcard
following any rule. I can look at it again sometime, but i am unsure
how to make antlr see farther into the fog of "there are infinite
unspecified char next".
> >correct by how LL lexers (unlike lex lexers) work.
>
> But LL(*) is supposed to look ahead as many characters as it needs
> to in order to determine that it really is a match, isn't it?
> Otherwise this is a step backwards from v2, where you could at
> least define an LL(k) lookahead limit greater than 1.
yep, but the fog after a lexer rule defeats it.
> >Remember folks: you can't just list a bunch of lexer rules that
> >make sense to a human and expect ANTLR to "make it so".
>
> Why not? This case seems fairly straightforward.
Not to me and I spent 4 years making LL(*) work. ;)
> Any time there's an optional block (whether ? or *), ANTLR should
> use sufficient lookahead (where "sufficient" might mean
> "unbounded", as in the example above)
Can you tell me how to compute the static LL lookahead? If not, then
trust that I know it's not possible or at least too much for my
brain. ;) It's the fog that causes the problem for LL(*). I could
backtrack automatically upon error like lex but it would be VERY slow
in a recursive descent lexer.
Have you built a lex like creature before? I have back in 1988. I
believe I'm remembering how it operates correctly.
Ter
More information about the antlr-interest
mailing list