[antlr-interest] lexer problem
Gavin Lambert
antlr at mirality.co.nz
Fri Oct 31 19:40:04 PDT 2008
At 10:07 1/11/2008, Robert Soule wrote:
>I was hoping someone might be able to help me out. I have the
>following grammar:
>
>grammar Test;
>start: '[' AB ']' | A;
>A: '[a]';
>AB: ('a' | 'b')+;
>
>In English, there is a keyword in my language '[a]', and
>all other statements are of the form: [(a|b)+]. I tried this
>with two test cases:
>
>test [ab] fails unexpectedly (no viable alternative)
>test [ba] succeeds
>
>I believe that the lexer sees a '[' character followed by
>an 'a' characters, and expects a ']' next, even though
>'a' or 'b' could also be valid next input characters. Has
>anyone had any experience with this type of issue?
Yeah, this is a common prefix problem :) (By which I both mean
that it's a common problem and that it's a problem with common
prefixes.)
Essentially what you've got above are the following lexer rules:
T15: '[';
T16: ']';
A: '[a]';
AB: ('a' | 'b')+;
To decide between these top-level alternatives, ANTLR essentially
builds a least-lookahead disambiguation table. With only one
character of lookahead, it can instantly recognise the difference
between T16, AB, and *either* of T15 and A, but it needs at least
two characters to tell between T15 and A. It never checks that
third character, which is what it'd need to look at to decide
between a single A vs. a T15 *followed by* an AB.
To deal with this kind of problem, you need to manually force the
necessary lookahead. You can do this by combining the rules with
common prefixes:
fragment A: '[a]';
LSQUARE: '[' ('a]' { $type = A; })? ;
Another way of writing it:
fragment A: '[a]';
LSQUARE
: (A) => A { $type = A; }
| '['
;
(Either way, of course, you'll need to refer to LSQUARE in your
parser rules after this.)
More information about the antlr-interest
mailing list