[antlr-interest] Fun with ANTLR3: mystery of the huge lexer (C#)

David Piepgrass qwertie256 at gmail.com
Sat Jun 30 09:16:16 PDT 2007


When I attempted to generate my "first try" at making a Lexer, ANTLR
produced a 21.9 megabyte source file (310,643 lines) filled mostly
with variables like dfa32_transition15; each transition array for the
most part contains the same number repeated thousands of times.

I was curious to know if the grammar actually worked, but after 15
minutes Visual C# still had not finished compiling it, though it was
using 1.6 GB of virtual memory. I killed it since the whole computer
was unresponsive (a longstanding flaw in Windows).

The problem seems related to these two rules, because without them, no
transition arrays are created at all:

ML_COMMENT:  '/*' (ML_COMMENT | .)* '*/' { $channel = HIDDEN; };

PUNC: ( ('\\' WS_CHAR)=>'\\'
      | (':'|'.'|'~'|'!'|'@'|'$'|'%'|'^'|'&'|'*'
        |'-'|'+'|'='|'|'|','|'<'|'>'|'/'|'?')
      )+;

If PUNC is in the grammar but not ML_COMMENT, a 2.38 MB file comes
out. If ML_COMMENT is present but not PUNC, a 8.16 MB file comes out.
I get a strange warning for ML_COMMENT (when PUNC is absent; if PUNC
is present, it gets even stranger):

[08:35:47] warning(206): Huge.g:1:10: Alternative 4: after matching
input such as '/''*''/''*''/''*''/''*''/'<EOT> decision cannot predict
what comes next due to recursion overflow to ML_COMMENT from
ML_COMMENT

Puzzling, yes? But with some fiddling, another warning suggests that
the problem has something to do with backslashes:

warning(206): Huge.g:1:10: Alternative 4: after matching input such as
'/''*''*''\\''/''*''*''\\''/''
*''*''\\''/''*''*''\\''*'{'\u0000'..'\b', '\u000B'..'\f',
'\u000E'..'\u001F', '!'..')', '+'..'.', '0
'..'[', ']'..'\uFFFE'}'/''\u0000'..'\uFFFE' decision cannot predict
what comes next due to recursion overflow to ML_COMMENT from
ML_COMMENT

I realize it comes from my regular expression rule:

RE_STRING: {input.LA(2) != '*'}? '/' (RE_CHAR)+ '/'; //	| '@/'
(RE_CHAR | WS_CHAR) '/';

RE_STRING: '/' (RE_CHAR)+ '/';
fragment RE_CHAR: RE_ESC | ~('/' | '\\' | '\r' | '\n' | ' ' | '\t' );
fragment RE_ESC: '\\' .;

So there's an ambiguity for something like /*\*/: is it a comment or a
regex? So I try

RE_STRING: {input.LA(2) != '*'}? '/' (RE_CHAR)+ '/';

19 MB.

ML_COMMENT: ('/*') => '/*' (ML_COMMENT | .)* '*/' { $channel = HIDDEN; };

8.16 MB.

In fact, this is all it takes to get a 19 MB output file:

--------------------------------------------------------------------------
lexer grammar Huge;
options
{
	language = 'CSharp';
	k = *;
}
fragment WS_CHAR: ' ' | '\t';
ML_COMMENT: ('/*')=> '/*' (ML_COMMENT | .)* '*/';
RE_STRING: '/' (RE_CHAR)+ '/';
fragment RE_CHAR: RE_ESC | ~('/' | '\\' | '\r' | '\n' | ' ' | '\t' );
fragment RE_ESC: '\\' .;
PUNC: ( ('\\' WS_CHAR)=>'\\'
      | (':'|'.'|'~'|'!'|'@'|'$'|'%'|'^'|'&'|'*'
        |'-'|'+'|'='|'|'|','|'<'|'>'|'/'|'?')
      )+;
--------------------------------------------------------------------------

Remove PUNC and you still get 6.3 MB and this warning:

[09:09:14] warning(206): Huge.g:1:10: Alternative 1: after matching
input such as '/''*''*''\\''/''*''*''\\''/''*''*''\\''/''*''*''\\''*'{'\u0000'..'\b',
'\u000B'..'\f', '\u000E'..'\u001F', '!'..')', '+'..'.', '0'..'[',
']'..'\uFFFE'}'/''\u0000'..'\uFFFE' decision cannot predict what comes
next due to recursion overflow to ML_COMMENT from ML_COMMENT

The strange thing is, the Java target doesn't have a problem. This
Lexer is "only" 30 KB in Java. I thought StringTemplate was just a
somewhat superficial conversion process - why the big difference?

And why doesn't the syntactic predicate help?

There is no problem if ML_COMMENT is the only rule in the file, but in
that case the rule doesn't work: apparently it never recognizes the
closing "*/" as such. Luckily, this solves the problem:

ML_COMMENT: ('/*')=> '/*' (options{greedy=false;} : ML_COMMENT | .)* '*/';

However, I am very curious to know how "greedy=false" affects the
subrule "ML_COMMENT"; ANTLR would not let me use simply

ML_COMMENT: ('/*')=> '/*' (ML_COMMENT | options{greedy=false;} : .)* '*/';

I know this has been long-winded but I hope someone could give me some
advice about how to get ML_COMMENT, RE_STRING and PUNC to work
together nicely.

-- 
- David
http://qism.blogspot.com


More information about the antlr-interest mailing list