[antlr-interest] Antlr 3 Lexer problem

Thomas Brandon tbrandonau at gmail.com
Wed Jun 27 07:34:11 PDT 2007


On 6/27/07, Geoffrey Zhu <gzhu at peak6.com> wrote:
> I see. This makes sense. So what do you think of doing the following?
> The DFA ANTLR generates is difficult to decipher in this case, but it
> should work according to your theory? If I let LP_OTHER emit two tokens,
> it should be fine?
>
> ID : ('a'..'z')+;
>
>
> fragment WS : ' ' | '\t' ;
>
> LPAREN    : '(' ;
>
> LP_SELECT : '(' WS* 'select';
>
> LP_OTHER:       '(' WS* ID;
>
>
> prog: (ID|LP_OTHER|LP_SELECT|WS)+ ;
>
>
By default ANTLR does not allow multiple tokens to be emitted, though
it can be modified to do so it is probably easier if you can avoid
that. Something along the lines of Jim's suggestion to use a predicate
seems better, but as Gavin noted and you found it can be tricky to get
ANTLR to actually use your predicate when it thinks the decision is
not syntactically ambiguous. Maybe something like:
LPAREN: '(' ( ('select')=> 'select' {$type = LP_SELECT;} )?;
would work. Though again ANTLR may choose to ignore the predicate.
Maybe using a semantic predicate, or gated semantic predicate would
work, though this would be somewhat annoying to write. Something like:
LP_SELECT: '(' { input.LT(1) == 's' && input.LT(2) == 'e' ... &&
input.LT(6) == 't' }? 'select';
Or you could wrap that up in a matchLT(input, 'select') function. Or
you may need to put this sort of predicate into your LPAREN rule as in
my first suggestion.

Tom.
>
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org
> [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Thomas Brandon
> Sent: Wednesday, June 27, 2007 3:05 AM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Antlr 3 Lexer problem
>
> On 6/27/07, Gavin Lambert <antlr at mirality.co.nz> wrote:
> > At 09:06 27/06/2007, Geoffrey Zhu wrote:
> >  >The syntactic predicate does not seem to work. The lexer chokes on
> > >exactly the same location 'c' if I pass in "( security".
> >  >
> >  >In mTokens() it still looks ahead only one step to determine what
> > >should e the next token.
> >
> > I think this is another occurrence of the case that Ter claims is by
> > design, but myself and a few others would like to be different:
> > the lexer doesn't do backtracking, it simply fails with
> > NoViableAltExceptions (or the equivalent) -- even when the parent
> > grammar does do backtracking.  Basically once it enters a particular
> > token it's going to either match that token or cause an error; it
> > won't go back and pick a different token.
> >
> I think Ter's argument was that the LL(*) algorithm used in the lexer is
> more powerful than backtracking. However this seems to be a case where
> the LL(*) algorithm falls over.
> If you look at the generated code you can see there is an mTokens rule
> with a comment "// T.g:1:10: ( ID | LPAREN | LP_SELECT )". So ANTLR is
> effectively generating a lexer for the grammar:
>
> MTOKENS
>         :       ID | LPAREN | LP_SELECT
>         ;
>
> fragment
> ID : ('a'..'z')+;
>
> fragment
> LPAREN : '(';
>
> fragment
> LP_SELECT : LPAREN 'select';
>
> For this grammar, ANTLR generates a correct lexer. MTOKENS can only
> return one of ID, LPAREN and LP_SELECT, so once it has seen the '('
> ANTLR only needs to look at the 's' to decide which rule to follow,
> given the 's' MOTKENS must match LP_SELECT or give an error as matching
> LPAREN LP_SELECT is not an option.
> However, don't you really want MTOKENS to be:
> MTOKENS
>         :       (ID | LPAREN | LP_SELECT)+
>         ;
> A lexer does return multiple tokens. Using this rule, ANTLR correctly
> checks for the entire 'select' string before deciding to go with
> LP_SELECT.
> This seems like a bug in ANTLR to me.
>
> Tom.
>
>
>
> _______________________________________________________
>
> The  information in this email or in any file attached
> hereto is intended only for the personal and confiden-
> tial  use  of  the individual or entity to which it is
> addressed and may contain information that is  propri-
> etary  and  confidential.  If you are not the intended
> recipient of this message you are hereby notified that
> any  review, dissemination, distribution or copying of
> this message is strictly prohibited.  This  communica-
> tion  is  for information purposes only and should not
> be regarded as an offer to sell or as  a  solicitation
> of an offer to buy any financial product. Email trans-
> mission cannot be guaranteed to be  secure  or  error-
> free. P6070214
>


More information about the antlr-interest mailing list