[antlr-interest] solution to lexer issue
Terence Parr
parrt at cs.usfca.edu
Fri Oct 26 18:15:09 PDT 2007
On Oct 25, 2007, at 6:40 PM, Gavin Lambert wrote:
> At 15:30 25/10/2007, Terence Parr wrote:
> >Solution is to change my assumption that any char can follow a
> >token (some of you don't believe me that is the problem but it is).
>
> I'm curious, isn't this "any char can follow a token" thing only
> true for (a) filter=true or (b) malformed input? Neither of which
> ought to be the common case?
Well, there was an error in my thinking here. That's why it's taken
4 years to get "right". Damn real world invalidates a number of my
assumptions and I have to alter the algorithm to suit. ;) Principle:
You only predict on valid input; I think errors are left to "if !
valid prediction"; i.e., the opposite of "good" must be "bad". No
need to look for "bad" explicitly.
> And I'm still not sure how assuming that any character at all could
> follow the "e" in "one" means that you don't need to test that the
> "e" is actually there at all. But I'll take your word for it.
I think a proof would abstract the issue to:
r : x .* | .* ;
where x is any alternative in any grammar. Static analysis involves
using more and more lookahead until you can distinguish between
alts. x and .* look identical by definition so you look past x
hoping to distinguish. Ooops. .* follows. At this point, you
concede defeat. Only solution is to predict with entire x; i.e., you
must backtrack like lex does or using Idle Jim's syn pred solution ;)
Now, had I done this sketch of a proof earlier I would realize that
assuming .* follows is overly general; well, actually, it's just
wrong. Using LL(*) rather than a DFA to do lexing is something ANTLR
along with other folks (scannerless parsing people) is pushing the
envelop on. I *hope* the change doesn't mess up stuff like keyword
vs ID. Could involve a lot of thought / testing on my part to verify
I haven't screwed up something else.
> >NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)?;
> >DOT : '.' ;
> >
> >NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)?;
> >OTHER: .;
> >
> >ONE: 'one';
> >TWO: 'two';
> >OTHER: .;
> [...]
> >Note that all three examples are ambiguous. Same input,
> >different rules can match.
>
> If all rules have equal precedence, then sure, they're ambiguous.
> But I thought the lexer was supposed to have defined precedence
> (longest match and/or first listed token wins)..?
Yep, hence you won't get an error between token rules ;)
Ter
More information about the antlr-interest
mailing list