[antlr-interest] Antlr 3 Lexer problem
tbrandonau at gmail.com
Wed Jun 27 01:05:13 PDT 2007
On 6/27/07, Gavin Lambert <antlr at mirality.co.nz> wrote:
> At 09:06 27/06/2007, Geoffrey Zhu wrote:
> >The syntactic predicate does not seem to work. The lexer chokes
> >exactly the same location 'c' if I pass in "( security".
> >In mTokens() it still looks ahead only one step to determine
> >should e the next token.
> I think this is another occurrence of the case that Ter claims is
> by design, but myself and a few others would like to be different:
> the lexer doesn't do backtracking, it simply fails with
> NoViableAltExceptions (or the equivalent) -- even when the parent
> grammar does do backtracking. Basically once it enters a
> particular token it's going to either match that token or cause an
> error; it won't go back and pick a different token.
I think Ter's argument was that the LL(*) algorithm used in the lexer
is more powerful than backtracking. However this seems to be a case
where the LL(*) algorithm falls over.
If you look at the generated code you can see there is an mTokens rule
with a comment "// T.g:1:10: ( ID | LPAREN | LP_SELECT )". So ANTLR is
effectively generating a lexer for the grammar:
: ID | LPAREN | LP_SELECT
ID : ('a'..'z')+;
LPAREN : '(';
LP_SELECT : LPAREN 'select';
For this grammar, ANTLR generates a correct lexer. MTOKENS can only
return one of ID, LPAREN and LP_SELECT, so once it has seen the '('
ANTLR only needs to look at the 's' to decide which rule to follow,
given the 's' MOTKENS must match LP_SELECT or give an error as
matching LPAREN LP_SELECT is not an option.
However, don't you really want MTOKENS to be:
: (ID | LPAREN | LP_SELECT)+
A lexer does return multiple tokens. Using this rule, ANTLR correctly
checks for the entire 'select' string before deciding to go with
This seems like a bug in ANTLR to me.
More information about the antlr-interest