[antlr-interest] Lexer matching non-matching rule

Sat May 16 03:52:13 PDT 2009

On Sat, 2009-05-16 at 08:27 +0530, Indhu Bharathi wrote:
> This is because on seeing 'f' of foo lexer has two options - 1. IDENT
> 2. URL. And it takes the second options since that seems to be longer
> that the first alternative. Note that the lexer always tries to match
> the longest token possible. 
> 
> After having decided to go for URL, it matches the input with URL and
> it fails. Lexer doesn't backtrack and hence throws an exception.

I suppose the key here is "lexer doesn't backtrack". The following
grammar has no problem matching inputs "foo" and "foobar", even though
one is a prefix of the other:

===============================================
grammar Y;
options { output=AST; }

file: (FOO|BAR|FOOBAR)* EOF;

FOO:    'foo';
BAR:    'bar';
FOOBAR: 'foobar';
SPACE:  (' ' | '\n')+  { $channel = HIDDEN; };
===============================================

If I remove the FOOBAR token from this grammar, input "foobar" is
matched as two tokens FOO BAR. But if instead I change the FOOBAR token
by adding another letter like this:

FOOBAR: 'foobarz';

Then the lexer fails to match "foobar" as two tokens, complaining about
a missing 'z'. It can still match "foo" however.

So if I dare to extrapolate something from this, it looks like that
while one token may be the prefix of another (obviously, otherwise it
would be difficult to create lexers for most programming languages), it
must not be the case that a proper prefix of one token can be matched as
more than one token. Is that it?

I can understand the motivation of this restriction in the interest of
keeping the lexer target code at a certain complexity level, but I have
not seen it stated in the documentation. It would have been nice if the
lexer generator had issued a warning.

J'