[antlr-interest] Missing characters in partial matches

Fri Aug 22 22:14:30 PDT 2008

On Sat, Aug 23, 2008 at 1:09 PM, Gavin Lambert <antlr at mirality.co.nz> wrote:
> At 13:11 23/08/2008, Matt Palmer wrote:
>>At the heart of my problem, I guess I'm not sure why, when the
>>start comment didn't match, the lexer didn't proceed to match a
>>Lsqb, followed by Text.  I can make it parse the text as given
>>(albeit awkwardly) by specifying all the intermediate prefixes as
>>other tokens, using this grammar:
> [...]
>>Comment :    '[!--'  (options {greedy=false;} : . )* '--]' ;
>>NotCom1 :    '[!-' ;
>>NotCom2 :    '[!';
>>Lsqb    :    '[' ;
>>Text    :    (~Lsqb)+ ;
>
> This is my pet peeve with the way that the v3 lexer
> operates.  (It's apparently mostly by design, but I think that
> it's an unhelpful design.  Ter has agreed to look into it at some
> point.)
>
> Now, I'm not really an expert in the internal workings of parsers,
> but as I understand it what's going on internally is that ANTLR
> builds a set of minimal lookahead to disambiguate between multiple
> tokens, *and* (and this is the bit that causes the odd behaviour)
> assumes that the tokens need not be contiguous -- it's allowed to
> have stray characters outside of any formal token, which it simply
> ignores.
>
> So when you've only got your original rules, ANTLR builds up an
> internal model that says something like this:
>
>   Ok, so the next character is a '['.
>   That means it can either be a Comment, a Lsqb, or a Text.
>   For it to be a Comment, the one after that would be a '!'.
>   For it to be a Lsqb, the one after that could be anything.
>   For it to be a Text, the one after that could be anything but a
> '['.
>   The following character is a '!'.
>   Score!  That looks like a Comment, I'll go with that.
>     (Comment wins both because it's more specific and because
> it's listed first, I think.)
>   Sweet, now we're certain it's a Comment, so the next character
> must be a '-'.
>   Wait, it's not.  Ok, so that's invalid input, ignore that and
> let's move on to something I can figure out.
>
> [This all might be completely wrong, of course, but it fits with
> what I've observed thus far.]
>
> With the rules you posted above, ANTLR has more tokens to choose
> between and this forces it to look ahead further before
> "committing" itself to a particular token.
That looks right. Examining the tokens DFA either through the function
for that in ANTLRWorks or by looking at the generated nextTokens rule
can be illuminating here.
>
> The general rule of thumb I tend to use is that most of the time
> the lexer seems to behave like it's just LL(1), so any tokens that
> have the same left edge need to be merged and given explicit
> "escape clauses" so they can do the right thing when they
> encounter something unexpected.
Not quite. ANTLR is LL(*), it looks ahead as many *characters* as are
needed not just 1, but not across token boundaries. So as long as the
alternates are all a single token there is no need to merge rules. But
if a sequence can be matched as either a single token or a sequence of
multiple tokens you must merge them as ANTLR will not consider the
possibility of multiple tokens matching the input. Apart from this I
don't think there is any advantage to merging except for particularly
nasty infinite sequences where you may end up with hideously
complicated predictors. Otherwise in the case of single tokens I think
merging will only hurt performance as the prediction must be
duplicated in nextToken and the actual token method.

Tom.
>
> Essentially the same as what Jim posted earlier, except that I
> think he forgot some of the punctuation; also, I prefer to
> explicitly write the content possibilities instead of using the
> '.':
>
> fragment Lsqb: '[';
>
> Comment
>   : '['
>     ( ('!--') => '!--' (~'-' | '-' (~'-' | '-' ~']'))* '--]'
>     | { $type = Lsqb; }
>     )
>   ;