[antlr-interest] Antlr 3 Lexer problem

Wed Jun 27 01:05:13 PDT 2007

On 6/27/07, Gavin Lambert <antlr at mirality.co.nz> wrote:
> At 09:06 27/06/2007, Geoffrey Zhu wrote:
>  >The syntactic predicate does not seem to work. The lexer chokes
> on
>  >exactly the same location 'c' if I pass in "( security".
>  >
>  >In mTokens() it still looks ahead only one step to determine
> what
>  >should e the next token.
>
> I think this is another occurrence of the case that Ter claims is
> by design, but myself and a few others would like to be different:
> the lexer doesn't do backtracking, it simply fails with
> NoViableAltExceptions (or the equivalent) -- even when the parent
> grammar does do backtracking.  Basically once it enters a
> particular token it's going to either match that token or cause an
> error; it won't go back and pick a different token.
>
I think Ter's argument was that the LL(*) algorithm used in the lexer
is more powerful than backtracking. However this seems to be a case
where the LL(*) algorithm falls over.
If you look at the generated code you can see there is an mTokens rule
with a comment "// T.g:1:10: ( ID | LPAREN | LP_SELECT )". So ANTLR is
effectively generating a lexer for the grammar:

MTOKENS
	:	ID | LPAREN | LP_SELECT
	;

fragment
ID : ('a'..'z')+;

fragment
LPAREN : '(';

fragment
LP_SELECT : LPAREN 'select';

For this grammar, ANTLR generates a correct lexer. MTOKENS can only
return one of ID, LPAREN and LP_SELECT, so once it has seen the '('
ANTLR only needs to look at the 's' to decide which rule to follow,
given the 's' MOTKENS must match LP_SELECT or give an error as
matching LPAREN LP_SELECT is not an option.
However, don't you really want MTOKENS to be:
MTOKENS
	:	(ID | LPAREN | LP_SELECT)+
	;
A lexer does return multiple tokens. Using this rule, ANTLR correctly
checks for the entire 'select' string before deciding to go with
LP_SELECT.
This seems like a bug in ANTLR to me.

Tom.