[antlr-interest] solution to lexer issue

Fri Oct 26 19:52:46 PDT 2007

At 14:15 27/10/2007, Terence Parr wrote:
 >You only predict on valid input; I think errors are left to
 >"if ! valid prediction"; i.e., the opposite of "good" must
 >be "bad". No need to look for "bad" explicitly.

Yep; only once you've been through the whole list of tokens and 
can't find anything to generate do you concede defeat, report an 
error, and then maybe try error-recovery mechanisms like skipping 
a character and trying again.

 >I think a proof would abstract the issue to:
 >
 >r : x .* | .* ;
 >
 >where x is any alternative in any grammar.  Static analysis
 >involves using more and more lookahead until you can
 >distinguish between alts.  x and .* look identical by
 >definition so you look past x hoping to distinguish.
 >Ooops. .* follows.  At this point, you concede defeat.
 >Only solution is to predict with entire x; i.e., you
 >must backtrack like lex does or using Idle Jim's syn pred
 >solution ;)

Ah, I see what you meant now.

 >Now, had I done this sketch of a proof earlier I would
 >realize that assuming .* follows is overly general; well,
 >actually, it's just wrong.

Given that alts *don't* have inherent precedence (or at least I 
don't think they do), would it behave better if you tweaked it to 
look more like this:

r : x .* | ~x .* ;

...?  Not saying that's right either (because you still don't want 
to consume erroneous input until you have no other choice), but it 
might be easier to get to from what you've got at the moment.

 >I *hope* the change doesn't mess up stuff like keyword
 >vs ID.  Could involve a lot of thought / testing on my
 >part to verify I haven't screwed up something else.

Simple solution for that: write lots of unit tests :)  (Especially 
lexer-only tests.  I've noticed a lot of the unit test snippets 
you've posted in the past have tended to be combined tests, even 
when using a trivial parser.  Possibly that was just coincidental.)