[antlr-interest] Missing characters in partial matches
Gavin Lambert
antlr at mirality.co.nz
Sat Aug 23 04:04:52 PDT 2008
At 17:14 23/08/2008, Thomas Brandon wrote:
>Not quite. ANTLR is LL(*), it looks ahead as many *characters*
>as are needed not just 1, but not across token boundaries. So
>as long as the alternates are all a single token there is no
>need to merge rules. But if a sequence can be matched as
>either a single token or a sequence of multiple tokens you
>must merge them as ANTLR will not consider the possibility of
>multiple tokens matching the input.
True; and the longest-match rule helps here, since in a language
with a keyword for 'begin' it'll still generate an Identifier when
faced with 'beginning', since it can consume more input that way.
But it's surprising how often this sort of issue comes up, so it's
always something you need to keep in the back of your mind. At
least until this gets changed, anyway ;)
(When I say "the lexer can act like it's LL(1)", I don't really
mean that it's LL(1) all the time, just that you need to be aware
that it tries to use the minimum amount of lookahead that it
thinks it can get away with, which is often only a single
character -- and not always sufficient to completely disambiguate,
especially when loops and optional paths are involved.)
More information about the antlr-interest
mailing list