[antlr-interest] Lexer errors when looking for wrong token

Mon Oct 11 04:57:19 PDT 2010

A Z wrote:

> I have a lexer with the following rules:
> 
> LBMINUSGT                  : '[->';
> LBASRB                     : '[*]';
> LBAST                      : '[*';
> LBEQUALS                   : '[=';
> LBPLUSRB                   : '[+]';
> LBRACE                     : '{';
> LBRACKET                   : '[';
> MINUS                      : '-';
> 
> The lexer fails(with an error message) when any string of '[-' or '[*' is
> detected. I'm confused why ANTLR cannot tokenize '[-' correctly as LBRACKET
> MINUS.

Because ANTLR-lexers cannot backtrack.

'[-' starts the token LBMINUSGT and only that token. Thus, when '['
and '-' arrive in input, recognition for the token LBMINUSGT is
started. When no '>' arrives, the lexer is not able to backtrack to
the point in time where '-' has not arrived and turn '[' into
LBRACKET. Since there are no other tokens that start with '[-', an
error is reported and error recovery takes place.

The canonical way to solve this problem is to create tokens that
cover all prefixes of all existing tokens. I.e., in your cited
grammar fragment you need additional tokens that match '[-' and '[+'.

I hope this makes the problem more understandable,

	Joachim

PS: Actually, there is a non-canonical way to solve the problem:
One can use a different tool to generate the lexer, one that can
backtrack, and use ANTLR only for its great parser abilities.
That's what I do, I use JFlex, after having fought with ANTLR lexer
definition restrictions one time too often. ;-)

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Joachim Schrod				Email: jschrod at acm.org
Roedermark, Germany