[antlr-interest] lexer problem

Fri Oct 31 19:40:04 PDT 2008

At 10:07 1/11/2008, Robert Soule wrote:
 >I was hoping someone might be able to help me out. I have the
 >following grammar:
 >
 >grammar Test;
 >start: '[' AB ']' | A;
 >A: '[a]';
 >AB: ('a' | 'b')+;
 >
 >In English, there is a keyword in my language '[a]', and
 >all other statements are of the form: [(a|b)+]. I tried this
 >with two test cases:
 >
 >test [ab] fails unexpectedly (no viable alternative)
 >test [ba] succeeds
 >
 >I believe that the lexer sees a '[' character followed by
 >an 'a' characters, and expects a ']' next, even though
 >'a' or 'b' could also be valid next input characters. Has
 >anyone had any experience with this type of issue?

Yeah, this is a common prefix problem :)  (By which I both mean 
that it's a common problem and that it's a problem with common 
prefixes.)

Essentially what you've got above are the following lexer rules:

T15: '[';
T16: ']';
A: '[a]';
AB: ('a' | 'b')+;

To decide between these top-level alternatives, ANTLR essentially 
builds a least-lookahead disambiguation table.  With only one 
character of lookahead, it can instantly recognise the difference 
between T16, AB, and *either* of T15 and A, but it needs at least 
two characters to tell between T15 and A.  It never checks that 
third character, which is what it'd need to look at to decide 
between a single A vs. a T15 *followed by* an AB.

To deal with this kind of problem, you need to manually force the 
necessary lookahead.  You can do this by combining the rules with 
common prefixes:

fragment A: '[a]';
LSQUARE: '[' ('a]' { $type = A; })? ;

Another way of writing it:

fragment A: '[a]';
LSQUARE
   :  (A) => A { $type = A; }
   |  '['
   ;

(Either way, of course, you'll need to refer to LSQUARE in your 
parser rules after this.)