[antlr-interest] Can antlr v3 lex star | tristar properly?

Guntis Ozols guntiso at latnet.lv
Wed Nov 21 07:14:04 PST 2007


Is it a bug or a feature that
  TRISTAR : ('***')=>'***'; does not work?

Is it a bug or a feature that
  STAR : '*' ('**' {type = TRISTAR;})?; does not work?

Can it be lexed with only syntactic predicates?

How can the following be lexed:
  DCOLON       : '::';
  NS_TEST      : NCName ':*';
  PrefixedName : NCName ':' NCName;
  NCName       : ('a'..'z' | 'A'..'Z' | '_')
                 ('a'..'z' | 'A'..'Z' | '.' | '-' | '_' | '0'..'9')*;

> The problem is basically that ANTLR doesn't do longest-match matching.
> It predicts the next rule that can possibly match based on a minimal
> number of lookahead symbols (characters, tokens or tree nodes).
>
> After seeing two STAR tokens as lookahead, it concludes that the only
> thing that makes sense should be TRISTAR. This behavior is probably
> not terribly intuitive, but as ANTLR doesn't backtrack like lex does
> (lex can simply backtrack in the internal state machine, ANTLR would
> have to do that across method calls...) it's pretty much unavoidable.
> In these cases you need to have some kind of predicate to help ANTLR.
> This should only apply to prefix problems like this, though.
>
> Here's my solution to the problem:
>
> stars	: (STAR | TRISTAR)* EOF;
>
> TRISTAR	: {input.LA(3) == '*'}? => '*' '*' '*';
> STAR	: '*';
>
> Works like a charm. Try it with five '*' chars in ANTLRWorks :)
> You only have to help out at one place here, to force it to match the
> longer token first. Pretty good tradeoff if you ask me.
>
> cheers,
> -k
> --
> Kay Röpke
> http://classdump.org/




More information about the antlr-interest mailing list