[antlr-interest] Greedy Lexer FYI: How to specify a string (or block comment) with multi-character delimiters (<<string>> or /*comment*/)
Austin Hastings
Austin_Hastings at Yahoo.com
Tue Nov 13 01:14:27 PST 2007
Howdy,
I'm writing this only because I just sank two hours chasing my tail, and
I feel like there ought to be SOMETHING to show for it. Maybe this will
sit on the mailing list and help Google save the next guy some time. :(
Boy, I thought this was trivial. I've piled a bunch of stuff into this
grammar to eliminate any chance that the lexer "knows" where it's going
to be in the file. That said, the QUOTED_LITERAL should recognize <<blah
blah blah\>> with escaped closing marks blah blah blah>>
<code>
grammar g;
QUOTED_LITERAL: Open (Slash Close| ~Close)* Close
;
fragment Slash: '\\';
fragment Open: '<<';
fragment Close: '>>';
// Below here is crap to spoof ANTLR about what it "knows" -- which
didn't work
rule1: (QUOTED_LITERAL | ID)+;
ID: 'a'..'z'+;
WS: (' ' | '\t' | '\n' | '\r')+ {$channel=HIDDEN;};
</code>
I get this generated java code:
<code>
public final void mQUOTED_LITERAL() throws RecognitionException {
try {
int _type = QUOTED_LITERAL;
// sources/org/antlr/gunit/g.g:5:17: Open ( Slash Close | ~ Close )* Close
mOpen();
// sources/org/antlr/gunit/g.g:5:22: ( Slash Close | ~ Close )*
loop1: do {
int alt1=3;
int LA1_0 = input.LA(1);
if ( (LA1_0=='>') ) {
int LA1_1 = input.LA(2);
if ( (LA1_1=='>') ) {
(**) int LA1_4 = input.LA(3);
if ( ((LA1_4>='\u0000' && LA1_4<='\uFFFE')) ) {
alt1=2;
}
}
else if ( ((LA1_1>='\u0000' && LA1_1<='=')||(LA1_1>='?' &&
LA1_1<='\uFFFE')) ) {
alt1=2;
}
}
else if ( (LA1_0=='\\') ) {
</code>
Notice the LA1_4 check? The code is working fine (since the default for
alt1=3 means "end of string reached") until the check if LA1_4 is ANY
character, whereupon the thing keeps eating characters.
The problem here is that the lexer is being greedy. After seeing several
places where .* and .+ occur, you might (and I definitely did - and lost
two hours because of it) think that the lexer is non-greedy by default.
Not so. As Terence says in his book,
"Unfortunately, following the usual convention that all subrules are
greedy makes this notation useless. Such greedy subrules would match all
characters until the end of file. ANTLR automatically makes these two
subrules nongreedy. So, you can use ’.*’ instead of manually specifying
the option."
Those TWO SUBRULES are automatically non-greedy. The lexer by default
still wants to gobble up everything in sight. Which means that for my
thing to work, I need to make my ()* loop non-greedy:
<code>
QUOTED_LITERAL: Open (options {greedy=false;}: Slash Close| ~Close)* Close
;
</code>
This gives me much better code:
<code>
mOpen();
// sources/org/antlr/gunit/g.g:5:22: ( options {greedy=false; } : Slash
Close | ~ Close )*
loop1: do {
int alt1=3;
int LA1_0 = input.LA(1);
if ( (LA1_0=='>') ) {
int LA1_1 = input.LA(2);
if ( (LA1_1=='>') ) {
alt1=3;
}
else if ( ((LA1_1>='\u0000' && LA1_1<='=')||(LA1_1>='?' &&
LA1_1<='\uFFFE')) ) {
alt1=2;
}
}
else if ( (LA1_0=='\\') ) {
alt1=1;
}
</code>
This is important for C-style /* comments */, since they use multi-byte
delimiters. Presumably also for Pascal (* comments *) as well.
Here's the same rule, with the Open and Close delimiters set to
double-quote("). Still with greedy=false, and the structure of the code
is similar, although recognizing the delimiter is much simpler:
<code>
mOpen();
// sources/org/antlr/gunit/g.g:5:22: ( options {greedy=false; } : Slash
Close | ~ Close )*
loop1:
do {
int alt1=3;
int LA1_0 = input.LA(1);
if ( (LA1_0=='\"') ) {
alt1=3;
}
else if ( (LA1_0=='\\') ) {
alt1=1;
}
else if ( ((LA1_0>='\u0000' && LA1_0<='!')||(LA1_0>='#' &&
LA1_0<='[')||(LA1_0>=']' && LA1_0<='\uFFFE')) ) {
alt1=2;
}
</code>
And with greedy=false turned off:
<code>
mOpen();
// sources/org/antlr/gunit/g.g:5:22: ( Slash Close | ~ Close )*
loop1:
do {
int alt1=3;
int LA1_0 = input.LA(1);
if ( (LA1_0=='\\') ) {
int LA1_2 = input.LA(2);
if ( (LA1_2=='\"') ) {
int LA1_4 = input.LA(3);
if ( ((LA1_4>='\u0000' && LA1_4<='\uFFFE')) ) {
alt1=1;
}
else {
alt1=2;
}
}
else if ( ((LA1_0>='\u0000' && LA1_0<='!')||(LA1_0>='#' &&
LA1_0<='[')||(LA1_0>=']' && LA1_0<='\uFFFE')) ) {
alt1=2;
}
</code>
In this case (no "greedy=false" specified, single-character delimiters)
the code generated is very different. Notice that the checks are in a
different order, and the default is to be non-greedy (a closing
double-quote doesn't execute ANY of the code shown, carrying alt1=3 down
through). I don't know if this is a special-case to eliminate
complaining newbies (like me) or if some transformation converts this to
.* under the hood or what. But it seems like there is some clever DWIM
code built in to ANTLR for the general case of recognizing strings with
single-character delimiters.
Now you know -- and knowing is half the battle!
=Austin
More information about the antlr-interest
mailing list