[antlr-interest] Can antlr v3 lex star | tristar properly?
Kay Röpke
kroepke at classdump.org
Wed Nov 21 06:21:58 PST 2007
Hi!
On Nov 21, 2007, at 2:00 PM, Steve Bennett wrote:
> On 11/21/07, Guntis Ozols <guntiso at latnet.lv> wrote:
>> Is there a way to lex this simple grammar (I am using ANTLRWorks
>> 1.1.4):
>
> Now, I know you said "lex" but just in case, this works:
No, it doesn't. It will always match 'star', and never 'tristar'
unless you use a predicate like this:
stars: ((tristar)=>tristar | star)*;
tristar: '*''*''*';
star: '*' {System.out.println("star");};
Watch what happens in ANTLRWorks!
> ----
> stars: (tristar | star )*;
>
> tristar: '*''*''*';
> star: '*';
> ----
>
> Can someone explain to me why this is so hard using just lexing rules?
> What goes wrong?
The problem is basically that ANTLR doesn't do longest-match matching.
It predicts the next rule that can possibly match based on a minimal
number of lookahead symbols (characters, tokens or tree nodes).
After seeing two STAR tokens as lookahead, it concludes that the only
thing that makes sense should be TRISTAR. This behavior is probably
not terribly intuitive, but as ANTLR doesn't backtrack like lex does
(lex can simply backtrack in the internal state machine, ANTLR would
have to do that across method calls...) it's pretty much unavoidable.
In these cases you need to have some kind of predicate to help ANTLR.
This should only apply to prefix problems like this, though.
Here's my solution to the problem:
stars : (STAR | TRISTAR)* EOF;
TRISTAR : {input.LA(3) == '*'}? => '*' '*' '*';
STAR : '*';
Works like a charm. Try it with five '*' chars in ANTLRWorks :)
You only have to help out at one place here, to force it to match the
longer token first. Pretty good tradeoff if you ask me.
cheers,
-k
--
Kay Röpke
http://classdump.org/
More information about the antlr-interest
mailing list