[antlr-interest] More on Lexer 2-char seq handling

Mon Oct 12 17:33:18 PDT 2009

Hi folks:

Further to the discussion on lexer matching sequence that should stop before some multi-character pattern:

I read Kirby's post with interest, including the list discussions pointed to.  I'm not sure what to make of it.  The oddity to me is that ANTLR *almost* generates the right things:

1. mTokens does the right thing.

2. The lexer rule code that matches/consumes the string in question does look ahead and see the error it would make if it consumed the end-before-this pattern.

3. ANTLR just doesn't generate the code to look ahead and *predict* that it should *stop*, it only looks ahead enough to predict which alternative *might* succeed based on the first character.

Making matters quite odd is that you can fake ANTLR into generating the correct look-ahead, though not completely desirable code, as shown below:

In his last post Gavin recommended how to fix Martin Potier's PURETEXT token rule:
-------------------------
PURETEXT: 
  ( 
      '[' ~'['
    | ']' ~']'
    | ~('\\' | '[' | ']' | '|' | '\n' )
  )+ 
 ;
-------------------------
(I removed mentions of tokens that Martin didn't give a definition for).

However, this fails in the manner described above.  Instead, the grammar below contains a solution of sorts.

-------------------------
grammar Potier;

links: link+
 ;

link:
  LO PURETEXT ('|' PURETEXT)? LE 
    {System.out.println("Link: " + $link.text); }
  ;

LO	: '[[';		// Link opening
LE	: ']]';		// Link ending

PURETEXT: 
  ( 
      '[' ~'['
    | ']' ~']'
    | ~('\\' | '[' | ']' | '|' | '\n' )
  )+ 
  ']]'   // And delete the match("]]") from gen code
 ;
-------------------------

The only thing I've added is the additional requirement that PURETEXT end with ']]'. This prompts ANTLR to generate LA(2) lookahead prediction code for the ()+ block, and break out on seeing ']]' coming up.  

Now of course we don't want to include ']]' in PURETEXT, and this can be fixed by editing the match("]]"); out of the generated rule.

That results in the desired behavior, but obviously is a silly thing to have to do.

I have to admit that this seems like a problem with the algorithm that ANTLR uses to determine LA(>1) lookahead when generating the lexer code.

-- Graham