[antlr-interest] Missing characters in partial matches

Sat Aug 23 04:04:52 PDT 2008

At 17:14 23/08/2008, Thomas Brandon wrote:
 >Not quite. ANTLR is LL(*), it looks ahead as many *characters*
 >as are needed not just 1, but not across token boundaries. So
 >as long as the alternates are all a single token there is no
 >need to merge rules.  But if a sequence can be matched as
 >either a single token or a sequence of multiple tokens you
 >must merge them as ANTLR will not consider the possibility of
 >multiple tokens matching the input.

True; and the longest-match rule helps here, since in a language 
with a keyword for 'begin' it'll still generate an Identifier when 
faced with 'beginning', since it can consume more input that way.

But it's surprising how often this sort of issue comes up, so it's 
always something you need to keep in the back of your mind.  At 
least until this gets changed, anyway ;)

(When I say "the lexer can act like it's LL(1)", I don't really 
mean that it's LL(1) all the time, just that you need to be aware 
that it tries to use the minimum amount of lookahead that it 
thinks it can get away with, which is often only a single 
character -- and not always sufficient to completely disambiguate, 
especially when loops and optional paths are involved.)