[antlr-interest] Greedy Lexer FYI: How to specify a string (or block comment) with multi-character delimiters (<<string>> or /comment/)

Tue Nov 13 01:14:27 PST 2007

Howdy,

I'm writing this only because I just sank two hours chasing my tail, and 
I feel like there ought to be SOMETHING to show for it. Maybe this will 
sit on the mailing list and help Google save the next guy some time. :(

Boy, I thought this was trivial. I've piled a bunch of stuff into this 
grammar to eliminate any chance that the lexer "knows" where it's going 
to be in the file. That said, the QUOTED_LITERAL should recognize <<blah 
blah blah\>> with escaped closing marks blah blah blah>>

<code>
grammar g;

QUOTED_LITERAL: Open (Slash Close| ~Close)* Close
;

fragment Slash: '\\';
fragment Open: '<<';
fragment Close: '>>';

// Below here is crap to spoof ANTLR about what it "knows" -- which 
didn't work
rule1: (QUOTED_LITERAL | ID)+;
ID: 'a'..'z'+;
WS: (' ' | '\t' | '\n' | '\r')+ {$channel=HIDDEN;};

</code>

I get this generated java code:

<code>

public final void mQUOTED_LITERAL() throws RecognitionException {
try {
int _type = QUOTED_LITERAL;
// sources/org/antlr/gunit/g.g:5:17: Open ( Slash Close | ~ Close )* Close
mOpen();
// sources/org/antlr/gunit/g.g:5:22: ( Slash Close | ~ Close )*
loop1: do {
int alt1=3;
int LA1_0 = input.LA(1);

if ( (LA1_0=='>') ) {
int LA1_1 = input.LA(2);
if ( (LA1_1=='>') ) {
(**) int LA1_4 = input.LA(3);

if ( ((LA1_4>='\u0000' && LA1_4<='\uFFFE')) ) {
alt1=2;
}
}
else if ( ((LA1_1>='\u0000' && LA1_1<='=')||(LA1_1>='?' && 
LA1_1<='\uFFFE')) ) {
alt1=2;
}
}
else if ( (LA1_0=='\\') ) {
</code>

Notice the LA1_4 check? The code is working fine (since the default for 
alt1=3 means "end of string reached") until the check if LA1_4 is ANY 
character, whereupon the thing keeps eating characters.

The problem here is that the lexer is being greedy. After seeing several 
places where .* and .+ occur, you might (and I definitely did - and lost 
two hours because of it) think that the lexer is non-greedy by default. 
Not so. As Terence says in his book,

"Unfortunately, following the usual convention that all subrules are 
greedy makes this notation useless. Such greedy subrules would match all 
characters until the end of file. ANTLR automatically makes these two 
subrules nongreedy. So, you can use ’.*’ instead of manually specifying 
the option."

Those TWO SUBRULES are automatically non-greedy. The lexer by default 
still wants to gobble up everything in sight. Which means that for my 
thing to work, I need to make my ()* loop non-greedy:

<code>

QUOTED_LITERAL: Open (options {greedy=false;}: Slash Close| ~Close)* Close
;

</code>

This gives me much better code:

<code>

mOpen();
// sources/org/antlr/gunit/g.g:5:22: ( options {greedy=false; } : Slash 
Close | ~ Close )*
loop1: do {
int alt1=3;
int LA1_0 = input.LA(1);

if ( (LA1_0=='>') ) {
int LA1_1 = input.LA(2);
if ( (LA1_1=='>') ) {
alt1=3;
}
else if ( ((LA1_1>='\u0000' && LA1_1<='=')||(LA1_1>='?' && 
LA1_1<='\uFFFE')) ) {
alt1=2;
}
}
else if ( (LA1_0=='\\') ) {
alt1=1;
}
</code>

This is important for C-style /* comments */, since they use multi-byte 
delimiters. Presumably also for Pascal (* comments *) as well.

Here's the same rule, with the Open and Close delimiters set to 
double-quote("). Still with greedy=false, and the structure of the code 
is similar, although recognizing the delimiter is much simpler:

<code>
mOpen();
// sources/org/antlr/gunit/g.g:5:22: ( options {greedy=false; } : Slash 
Close | ~ Close )*
loop1:
do {
int alt1=3;
int LA1_0 = input.LA(1);

if ( (LA1_0=='\"') ) {
alt1=3;
}
else if ( (LA1_0=='\\') ) {
alt1=1;
}
else if ( ((LA1_0>='\u0000' && LA1_0<='!')||(LA1_0>='#' && 
LA1_0<='[')||(LA1_0>=']' && LA1_0<='\uFFFE')) ) {
alt1=2;
}
</code>

And with greedy=false turned off:

<code>
mOpen();
// sources/org/antlr/gunit/g.g:5:22: ( Slash Close | ~ Close )*
loop1:
do {
int alt1=3;
int LA1_0 = input.LA(1);

if ( (LA1_0=='\\') ) {
int LA1_2 = input.LA(2);
if ( (LA1_2=='\"') ) {
int LA1_4 = input.LA(3);

if ( ((LA1_4>='\u0000' && LA1_4<='\uFFFE')) ) {
alt1=1;
}
else {
alt1=2;
}
}
else if ( ((LA1_0>='\u0000' && LA1_0<='!')||(LA1_0>='#' && 
LA1_0<='[')||(LA1_0>=']' && LA1_0<='\uFFFE')) ) {
alt1=2;
}
</code>

In this case (no "greedy=false" specified, single-character delimiters) 
the code generated is very different. Notice that the checks are in a 
different order, and the default is to be non-greedy (a closing 
double-quote doesn't execute ANY of the code shown, carrying alt1=3 down 
through). I don't know if this is a special-case to eliminate 
complaining newbies (like me) or if some transformation converts this to 
.* under the hood or what. But it seems like there is some clever DWIM 
code built in to ANTLR for the general case of recognizing strings with 
single-character delimiters.

Now you know -- and knowing is half the battle!

=Austin

[antlr-interest] Greedy Lexer FYI: How to specify a string (or block comment) with multi-character delimiters (<<string>> or /*comment*/)

[antlr-interest] Greedy Lexer FYI: How to specify a string (or block comment) with multi-character delimiters (<<string>> or /comment/)