[antlr-interest] solution to lexer issue

Fri Oct 26 18:15:09 PDT 2007

On Oct 25, 2007, at 6:40 PM, Gavin Lambert wrote:

> At 15:30 25/10/2007, Terence Parr wrote:
> >Solution is to change my assumption that any char can follow a
> >token (some of you don't believe me that is the problem but it is).
>
> I'm curious, isn't this "any char can follow a token" thing only  
> true for (a) filter=true or (b) malformed input?  Neither of which  
> ought to be the common case?

Well, there was an error in my thinking here.  That's why it's taken  
4 years to get "right".  Damn real world invalidates a number of my  
assumptions and I have to alter the algorithm to suit. ;)  Principle:  
You only predict on valid input; I think errors are left to "if ! 
valid prediction"; i.e., the opposite of "good" must be "bad". No  
need to look for "bad" explicitly.

> And I'm still not sure how assuming that any character at all could  
> follow the "e" in "one" means that you don't need to test that the  
> "e" is actually there at all.  But I'll take your word for it.

I think a proof would abstract the issue to:

r : x .* | .* ;

where x is any alternative in any grammar.  Static analysis involves  
using more and more lookahead until you can distinguish between  
alts.  x and .* look identical by definition so you look past x  
hoping to distinguish.  Ooops. .* follows.  At this point, you  
concede defeat.  Only solution is to predict with entire x; i.e., you  
must backtrack like lex does or using Idle Jim's syn pred solution ;)

Now, had I done this sketch of a proof earlier I would realize that  
assuming .* follows is overly general; well, actually, it's just  
wrong.  Using LL(*) rather than a DFA to do lexing is something ANTLR  
along with other folks (scannerless parsing people) is pushing the  
envelop on.  I *hope* the change doesn't mess up stuff like keyword  
vs ID.  Could involve a lot of thought / testing on my part to verify  
I haven't screwed up something else.

> >NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)?;
> >DOT : '.' ;
> >
> >NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)?;
> >OTHER: .;
> >
> >ONE: 'one';
> >TWO: 'two';
> >OTHER: .;
> [...]
> >Note that all three examples are ambiguous. Same input,
> >different rules can match.
>
> If all rules have equal precedence, then sure, they're ambiguous.   
> But I thought the lexer was supposed to have defined precedence  
> (longest match and/or first listed token wins)..?

Yep, hence you won't get an error between token rules ;)

Ter