[antlr-interest] Can antlr v3 lex star | tristar properly?

Wed Nov 21 10:20:44 PST 2007

You can do this without predicates though. People seem to reach for the predicate stick immediately the lexer doesn't do what they want. Look at the common routes through the tokens and make a single lex rule with alts at each of those points. The lexer code will then be very natural.

If you were coding if statements in a language, you would not code:

If (*(a+2) == '*' && *(a+1) == '*' && *a == '*') {sdkdslkf} else if (*a == '*' { ****

You code:

if (*a == '*')
{
  If (*(a+1) == '*' && *(a+2) == '*')
  {
    L;
  }
  else
  {
    X;
...

So, if you think of constructing the lexer rules as guiding it through the characters it is coming across, you will arrive at:

T : '*'
     (
        // Here can be tristar or star
        //
          '**'  { $type = T3; } // Tristar
        | // Anything else means this is just a T
     )
  ;

And will require no predicates.

A predicate might help you here if you only want to take the '***' path if there really are three '*', instead of erroring out in the lexer if there are two. However, Personally I prefer to catch things like that and issue a specific error message (which programmers like).

You can say:

T : '*'
     (
        // Here can be tristar or star or perhaps error
        //
          '*'
		(
                 '*'  { $type = T3; } // Tristar
               | { // call "Error: malformed operator '**' did you mean '*' or '***' }
             )
        | // Anything else means this is just a T
     )

That kind of technique is useful for trapping un-terminated quoted strings while you are in the lexer and so on.

Jim

> -----Original Message-----
> From: Kay Röpke [mailto:kroepke at classdump.org]
> Sent: Wednesday, November 21, 2007 6:22 AM
> To: Steve Bennett
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Can antlr v3 lex star | tristar properly?
> 
> Hi!
> 
> On Nov 21, 2007, at 2:00 PM, Steve Bennett wrote:
> 
> > On 11/21/07, Guntis Ozols <guntiso at latnet.lv> wrote:
> >> Is there a way to lex this simple grammar (I am using ANTLRWorks
> >> 1.1.4):
> >
> > Now, I know you said "lex" but just in case, this works:
> 
> No, it doesn't. It will always match 'star', and never 'tristar'
> unless you use a predicate like this:
> 
> stars: ((tristar)=>tristar | star)*;
> 
> tristar: '*''*''*';
> star: '*' {System.out.println("star");};
> 
> Watch what happens in ANTLRWorks!
> 
> > ----
> > stars: (tristar | star )*;
> >
> > tristar: '*''*''*';
> > star: '*';
> > ----
> >
> > Can someone explain to me why this is so hard using just lexing rules?
> > What goes wrong?
> 
> 
> The problem is basically that ANTLR doesn't do longest-match matching.
> It predicts the next rule that can possibly match based on a minimal
> number of lookahead symbols (characters, tokens or tree nodes).
> 
> After seeing two STAR tokens as lookahead, it concludes that the only
> thing that makes sense should be TRISTAR. This behavior is probably
> not terribly intuitive, but as ANTLR doesn't backtrack like lex does
> (lex can simply backtrack in the internal state machine, ANTLR would
> have to do that across method calls...) it's pretty much unavoidable.
> In these cases you need to have some kind of predicate to help ANTLR.
> This should only apply to prefix problems like this, though.
> 
> Here's my solution to the problem:
> 
> stars	: (STAR | TRISTAR)* EOF;
> 
> TRISTAR	: {input.LA(3) == '*'}? => '*' '*' '*';
> STAR	: '*';
> 
> Works like a charm. Try it with five '*' chars in ANTLRWorks :)
> You only have to help out at one place here, to force it to match the
> longer token first. Pretty good tradeoff if you ask me.
> 
> cheers,
> -k
> --
> Kay Röpke
> http://classdump.org/
> 
> 
> 
> 
> 
>