[antlr-interest] Missing characters in partial matches

Fri Aug 22 20:09:37 PDT 2008

At 13:11 23/08/2008, Matt Palmer wrote:
>At the heart of my problem, I guess I'm not sure why, when the 
>start comment didn't match, the lexer didn't proceed to match a 
>Lsqb, followed by Text.  I can make it parse the text as given 
>(albeit awkwardly) by specifying all the intermediate prefixes as 
>other tokens, using this grammar:
[...]
>Comment :    '[!--'  (options {greedy=false;} : . )* '--]' ;
>NotCom1 :    '[!-' ;
>NotCom2 :    '[!';
>Lsqb    :    '[' ;
>Text    :    (~Lsqb)+ ;

This is my pet peeve with the way that the v3 lexer 
operates.  (It's apparently mostly by design, but I think that 
it's an unhelpful design.  Ter has agreed to look into it at some 
point.)

Now, I'm not really an expert in the internal workings of parsers, 
but as I understand it what's going on internally is that ANTLR 
builds a set of minimal lookahead to disambiguate between multiple 
tokens, *and* (and this is the bit that causes the odd behaviour) 
assumes that the tokens need not be contiguous -- it's allowed to 
have stray characters outside of any formal token, which it simply 
ignores.

So when you've only got your original rules, ANTLR builds up an 
internal model that says something like this:

   Ok, so the next character is a '['.
   That means it can either be a Comment, a Lsqb, or a Text.
   For it to be a Comment, the one after that would be a '!'.
   For it to be a Lsqb, the one after that could be anything.
   For it to be a Text, the one after that could be anything but a 
'['.
   The following character is a '!'.
   Score!  That looks like a Comment, I'll go with that.
     (Comment wins both because it's more specific and because 
it's listed first, I think.)
   Sweet, now we're certain it's a Comment, so the next character 
must be a '-'.
   Wait, it's not.  Ok, so that's invalid input, ignore that and 
let's move on to something I can figure out.

[This all might be completely wrong, of course, but it fits with 
what I've observed thus far.]

With the rules you posted above, ANTLR has more tokens to choose 
between and this forces it to look ahead further before 
"committing" itself to a particular token.

The general rule of thumb I tend to use is that most of the time 
the lexer seems to behave like it's just LL(1), so any tokens that 
have the same left edge need to be merged and given explicit 
"escape clauses" so they can do the right thing when they 
encounter something unexpected.

Essentially the same as what Jim posted earlier, except that I 
think he forgot some of the punctuation; also, I prefer to 
explicitly write the content possibilities instead of using the 
'.':

fragment Lsqb: '[';

Comment
   : '['
     ( ('!--') => '!--' (~'-' | '-' (~'-' | '-' ~']'))* '--]'
     | { $type = Lsqb; }
     )
   ;