[antlr-interest] Missing characters in partial matches
Gavin Lambert
antlr at mirality.co.nz
Fri Aug 22 20:09:37 PDT 2008
At 13:11 23/08/2008, Matt Palmer wrote:
>At the heart of my problem, I guess I'm not sure why, when the
>start comment didn't match, the lexer didn't proceed to match a
>Lsqb, followed by Text. I can make it parse the text as given
>(albeit awkwardly) by specifying all the intermediate prefixes as
>other tokens, using this grammar:
[...]
>Comment : '[!--' (options {greedy=false;} : . )* '--]' ;
>NotCom1 : '[!-' ;
>NotCom2 : '[!';
>Lsqb : '[' ;
>Text : (~Lsqb)+ ;
This is my pet peeve with the way that the v3 lexer
operates. (It's apparently mostly by design, but I think that
it's an unhelpful design. Ter has agreed to look into it at some
point.)
Now, I'm not really an expert in the internal workings of parsers,
but as I understand it what's going on internally is that ANTLR
builds a set of minimal lookahead to disambiguate between multiple
tokens, *and* (and this is the bit that causes the odd behaviour)
assumes that the tokens need not be contiguous -- it's allowed to
have stray characters outside of any formal token, which it simply
ignores.
So when you've only got your original rules, ANTLR builds up an
internal model that says something like this:
Ok, so the next character is a '['.
That means it can either be a Comment, a Lsqb, or a Text.
For it to be a Comment, the one after that would be a '!'.
For it to be a Lsqb, the one after that could be anything.
For it to be a Text, the one after that could be anything but a
'['.
The following character is a '!'.
Score! That looks like a Comment, I'll go with that.
(Comment wins both because it's more specific and because
it's listed first, I think.)
Sweet, now we're certain it's a Comment, so the next character
must be a '-'.
Wait, it's not. Ok, so that's invalid input, ignore that and
let's move on to something I can figure out.
[This all might be completely wrong, of course, but it fits with
what I've observed thus far.]
With the rules you posted above, ANTLR has more tokens to choose
between and this forces it to look ahead further before
"committing" itself to a particular token.
The general rule of thumb I tend to use is that most of the time
the lexer seems to behave like it's just LL(1), so any tokens that
have the same left edge need to be merged and given explicit
"escape clauses" so they can do the right thing when they
encounter something unexpected.
Essentially the same as what Jim posted earlier, except that I
think he forgot some of the punctuation; also, I prefer to
explicitly write the content possibilities instead of using the
'.':
fragment Lsqb: '[';
Comment
: '['
( ('!--') => '!--' (~'-' | '-' (~'-' | '-' ~']'))* '--]'
| { $type = Lsqb; }
)
;
More information about the antlr-interest
mailing list