[antlr-interest] Can antlr v3 lex star | tristar properly?

Kay Röpke kroepke at classdump.org
Wed Nov 21 06:21:58 PST 2007


Hi!

On Nov 21, 2007, at 2:00 PM, Steve Bennett wrote:

> On 11/21/07, Guntis Ozols <guntiso at latnet.lv> wrote:
>> Is there a way to lex this simple grammar (I am using ANTLRWorks  
>> 1.1.4):
>
> Now, I know you said "lex" but just in case, this works:

No, it doesn't. It will always match 'star', and never 'tristar'  
unless you use a predicate like this:

stars: ((tristar)=>tristar | star)*;

tristar: '*''*''*';
star: '*' {System.out.println("star");};

Watch what happens in ANTLRWorks!

> ----
> stars: (tristar | star )*;
>
> tristar: '*''*''*';
> star: '*';
> ----
>
> Can someone explain to me why this is so hard using just lexing rules?
> What goes wrong?


The problem is basically that ANTLR doesn't do longest-match matching.  
It predicts the next rule that can possibly match based on a minimal  
number of lookahead symbols (characters, tokens or tree nodes).

After seeing two STAR tokens as lookahead, it concludes that the only  
thing that makes sense should be TRISTAR. This behavior is probably  
not terribly intuitive, but as ANTLR doesn't backtrack like lex does  
(lex can simply backtrack in the internal state machine, ANTLR would  
have to do that across method calls...) it's pretty much unavoidable.
In these cases you need to have some kind of predicate to help ANTLR.  
This should only apply to prefix problems like this, though.

Here's my solution to the problem:

stars	: (STAR | TRISTAR)* EOF;

TRISTAR	: {input.LA(3) == '*'}? => '*' '*' '*';
STAR	: '*';

Works like a charm. Try it with five '*' chars in ANTLRWorks :)
You only have to help out at one place here, to force it to match the  
longer token first. Pretty good tradeoff if you ask me.

cheers,
-k
-- 
Kay Röpke
http://classdump.org/








More information about the antlr-interest mailing list