[antlr-interest] More on Lexer 2-char seq handling
Graham Wideman
gwlist at grahamwideman.com
Mon Oct 12 17:33:18 PDT 2009
Hi folks:
Further to the discussion on lexer matching sequence that should stop before some multi-character pattern:
I read Kirby's post with interest, including the list discussions pointed to. I'm not sure what to make of it. The oddity to me is that ANTLR *almost* generates the right things:
1. mTokens does the right thing.
2. The lexer rule code that matches/consumes the string in question does look ahead and see the error it would make if it consumed the end-before-this pattern.
3. ANTLR just doesn't generate the code to look ahead and *predict* that it should *stop*, it only looks ahead enough to predict which alternative *might* succeed based on the first character.
Making matters quite odd is that you can fake ANTLR into generating the correct look-ahead, though not completely desirable code, as shown below:
In his last post Gavin recommended how to fix Martin Potier's PURETEXT token rule:
-------------------------
PURETEXT:
(
'[' ~'['
| ']' ~']'
| ~('\\' | '[' | ']' | '|' | '\n' )
)+
;
-------------------------
(I removed mentions of tokens that Martin didn't give a definition for).
However, this fails in the manner described above. Instead, the grammar below contains a solution of sorts.
-------------------------
grammar Potier;
links: link+
;
link:
LO PURETEXT ('|' PURETEXT)? LE
{System.out.println("Link: " + $link.text); }
;
LO : '[['; // Link opening
LE : ']]'; // Link ending
PURETEXT:
(
'[' ~'['
| ']' ~']'
| ~('\\' | '[' | ']' | '|' | '\n' )
)+
']]' // And delete the match("]]") from gen code
;
-------------------------
The only thing I've added is the additional requirement that PURETEXT end with ']]'. This prompts ANTLR to generate LA(2) lookahead prediction code for the ()+ block, and break out on seeing ']]' coming up.
Now of course we don't want to include ']]' in PURETEXT, and this can be fixed by editing the match("]]"); out of the generated rule.
That results in the desired behavior, but obviously is a silly thing to have to do.
I have to admit that this seems like a problem with the algorithm that ANTLR uses to determine LA(>1) lookahead when generating the lexer code.
-- Graham
More information about the antlr-interest
mailing list