[antlr-interest] Antlr 3 Lexer problem

Wed Jun 27 06:35:23 PDT 2007

I see. This makes sense. So what do you think of doing the following?
The DFA ANTLR generates is difficult to decipher in this case, but it
should work according to your theory? If I let LP_OTHER emit two tokens,
it should be fine?

ID : ('a'..'z')+;

fragment WS : ' ' | '\t' ;

LPAREN    : '(' ;

LP_SELECT : '(' WS* 'select';

LP_OTHER:	'(' WS* ID;	 

prog: (ID|LP_OTHER|LP_SELECT|WS)+ ;

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Thomas Brandon
Sent: Wednesday, June 27, 2007 3:05 AM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Antlr 3 Lexer problem

On 6/27/07, Gavin Lambert <antlr at mirality.co.nz> wrote:
> At 09:06 27/06/2007, Geoffrey Zhu wrote:
>  >The syntactic predicate does not seem to work. The lexer chokes on  
> >exactly the same location 'c' if I pass in "( security".
>  >
>  >In mTokens() it still looks ahead only one step to determine what  
> >should e the next token.
>
> I think this is another occurrence of the case that Ter claims is by 
> design, but myself and a few others would like to be different:
> the lexer doesn't do backtracking, it simply fails with 
> NoViableAltExceptions (or the equivalent) -- even when the parent 
> grammar does do backtracking.  Basically once it enters a particular 
> token it's going to either match that token or cause an error; it 
> won't go back and pick a different token.
>
I think Ter's argument was that the LL(*) algorithm used in the lexer is
more powerful than backtracking. However this seems to be a case where
the LL(*) algorithm falls over.
If you look at the generated code you can see there is an mTokens rule
with a comment "// T.g:1:10: ( ID | LPAREN | LP_SELECT )". So ANTLR is
effectively generating a lexer for the grammar:

MTOKENS
	:	ID | LPAREN | LP_SELECT
	;

fragment
ID : ('a'..'z')+;

fragment
LPAREN : '(';

fragment
LP_SELECT : LPAREN 'select';

For this grammar, ANTLR generates a correct lexer. MTOKENS can only
return one of ID, LPAREN and LP_SELECT, so once it has seen the '('
ANTLR only needs to look at the 's' to decide which rule to follow,
given the 's' MOTKENS must match LP_SELECT or give an error as matching
LPAREN LP_SELECT is not an option.
However, don't you really want MTOKENS to be:
MTOKENS
	:	(ID | LPAREN | LP_SELECT)+
	;
A lexer does return multiple tokens. Using this rule, ANTLR correctly
checks for the entire 'select' string before deciding to go with
LP_SELECT.
This seems like a bug in ANTLR to me.

Tom.

_______________________________________________________=0A=
=0A=
The  information in this email or in any file attached=0A=
hereto is intended only for the personal and confiden-=0A=
tial  use  of  the individual or entity to which it is=0A=
addressed and may contain information that is  propri-=0A=
etary  and  confidential.  If you are not the intended=0A=
recipient of this message you are hereby notified that=0A=
any  review, dissemination, distribution or copying of=0A=
this message is strictly prohibited.  This  communica-=0A=
tion  is  for information purposes only and should not=0A=
be regarded as an offer to sell or as  a  solicitation=0A=
of an offer to buy any financial product. Email trans-=0A=
mission cannot be guaranteed to be  secure  or  error-=0A=
free. P6070214