[antlr-interest] More on Lexer 2-char seq handling

Mon Oct 12 19:03:54 PDT 2009

By way of completeness, here I add to my previous message a way to get ANTLR to generate a functioning rule method with the desired lookahead detection for "stop before this 2-char sequence", without hand editing the generated code.

Sticking with a variant of Martin Potier's grammar:

-------------------------------------
grammar Potier;

file: (link | PURETEXT)+
 ;

link:
  LO PURETEXT ('|' PURETEXT)? LE 
    {System.out.println("Link: " + $link.text); }
  ;

LO	: '[[';		// Link opening
LE	: ']]';		// Link ending

PURETEXT: 
  ( 
      '[' ~'['
    | ']' ~']'
    | ~('\\' | '[' | ']' | '|' | '\n' )
  )+ 
  ( (']]' | '[[') { input.seek( input.index()-2); })?  // Rewind the [[ or ]]
 ;
-------------------------------------

... which seems to properly digest input like:

  [[ghi]] tuv [[j[k]l]] qrs 

Outputing from the println:
  Link: [[ghi]]
  Link: [[j[k]l]]

In case it's not obvious, the key here is that the PURETEXT lexer rule *includes* the "stop before this" symbols, but then rewinds the stream (ie: seek back 2 chars).  For the example's sake, I made this optional ()? in case PURETEXT is used NOT within [[ ]].

Adding the "stop before this" part to PURETEXT causes the appropriate lookahead pattern match ("prediction") to get generated, so it seems very odd that this same code can't result from just:
-------------------
PURETEXT: 
  ( 
      '[' ~'['
    | ']' ~']'
    | ~('\\' | '[' | ']' | '|' | '\n' )
  )+
-------------------

... which instead generates code that is guaranteed to fail. Ie: it doesn't even produce code that could be construed as an alternate correct interpretation.

-- Graham