[antlr-interest] Fun with ANTLR3: mystery of the huge lexer (C#)

Sat Jun 30 09:47:15 PDT 2007

David,

Your ML_COMMENT needs to be a fragment rule and you need a predicate to
stop '.' interfering with ML_COMMENT. I just produce this rule for my
T-SQL lexer in fact (C here but the predicate is just input.LA(n) for
Java):

// A multiline comment is akin to a C style comment and is bounded
// by /* and */. However the T-SQL lexer allows for, and checks
// embedded comments. See how here we use a fragment rule to define
// the lexical construct, as this does not try to create tokens and
// hence can be called recursively by itself. The actual token making
// rule here then, just calls that fragment rule.
//
ML_COMMENT
	:	ML_COMFRAG
		{
			$channel = HIDDEN;
		}
	;

// This rule is a fragment so that it can call itself recursively
// and deal with multiple embedded comments.
//
fragment	ML_COMFRAG
		:
			'/*' ( options { greedy=false;}

					// The predicate looks for the
start of an embedded comment
					// and this triggers a recursive
call of this rule
					// and therefore automatically
matches /* and */ pairs.
					//
					: {(LA(1)== '/' && LA(2) ==
'*')}? ML_COMFRAG 
					|  .
					)* '*/'

		;

That should help with that part. Then is your PUNC rule something that
returns a token, or are you using that somewhere else too?

Check your lexer rules and if you use one rule inside another, ensure
that it is a fragment rule. My guess is that will cure most of your
issues here.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of David Piepgrass
> Sent: Saturday, June 30, 2007 9:16 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Fun with ANTLR3: mystery of the huge lexer
> (C#)
> 
> When I attempted to generate my "first try" at making a Lexer, ANTLR
> produced a 21.9 megabyte source file (310,643 lines) filled mostly
> with variables like dfa32_transition15; each transition array for the
> most part contains the same number repeated thousands of times.
> 
> I was curious to know if the grammar actually worked, but after 15
> minutes Visual C# still had not finished compiling it, though it was
> using 1.6 GB of virtual memory. I killed it since the whole computer
> was unresponsive (a longstanding flaw in Windows).
> 
> The problem seems related to these two rules, because without them, no
> transition arrays are created at all:
> 
> ML_COMMENT:  '/*' (ML_COMMENT | .)* '*/' { $channel = HIDDEN; };
> 
> PUNC: ( ('\\' WS_CHAR)=>'\\'
>       | (':'|'.'|'~'|'!'|'@'|'$'|'%'|'^'|'&'|'*'
>         |'-'|'+'|'='|'|'|','|'<'|'>'|'/'|'?')
>       )+;
> 
> If PUNC is in the grammar but not ML_COMMENT, a 2.38 MB file comes
> out. If ML_COMMENT is present but not PUNC, a 8.16 MB file comes out.
> I get a strange warning for ML_COMMENT (when PUNC is absent; if PUNC
> is present, it gets even stranger):
> 
> [08:35:47] warning(206): Huge.g:1:10: Alternative 4: after matching
> input such as '/''*''/''*''/''*''/''*''/'<EOT> decision cannot predict
> what comes next due to recursion overflow to ML_COMMENT from
> ML_COMMENT
> 
> Puzzling, yes? But with some fiddling, another warning suggests that
> the problem has something to do with backslashes:
> 
> warning(206): Huge.g:1:10: Alternative 4: after matching input such as
> '/''*''*''\\''/''*''*''\\''/''
> *''*''\\''/''*''*''\\''*'{'\u0000'..'\b', '\u000B'..'\f',
> '\u000E'..'\u001F', '!'..')', '+'..'.', '0
> '..'[', ']'..'\uFFFE'}'/''\u0000'..'\uFFFE' decision cannot predict
> what comes next due to recursion overflow to ML_COMMENT from
> ML_COMMENT
> 
> I realize it comes from my regular expression rule:
> 
> RE_STRING: {input.LA(2) != '*'}? '/' (RE_CHAR)+ '/'; //	| '@/'
> (RE_CHAR | WS_CHAR) '/';
> 
> RE_STRING: '/' (RE_CHAR)+ '/';
> fragment RE_CHAR: RE_ESC | ~('/' | '\\' | '\r' | '\n' | ' ' | '\t' );
> fragment RE_ESC: '\\' .;
> 
> So there's an ambiguity for something like /*\*/: is it a comment or a
> regex? So I try
> 
> RE_STRING: {input.LA(2) != '*'}? '/' (RE_CHAR)+ '/';
> 
> 19 MB.
> 
> ML_COMMENT: ('/*') => '/*' (ML_COMMENT | .)* '*/' { $channel = HIDDEN;
> };
> 
> 8.16 MB.
> 
> In fact, this is all it takes to get a 19 MB output file:
> 
>
-----------------------------------------------------------------------
> ---
> lexer grammar Huge;
> options
> {
> 	language = 'CSharp';
> 	k = *;
> }
> fragment WS_CHAR: ' ' | '\t';
> ML_COMMENT: ('/*')=> '/*' (ML_COMMENT | .)* '*/';
> RE_STRING: '/' (RE_CHAR)+ '/';
> fragment RE_CHAR: RE_ESC | ~('/' | '\\' | '\r' | '\n' | ' ' | '\t' );
> fragment RE_ESC: '\\' .;
> PUNC: ( ('\\' WS_CHAR)=>'\\'
>       | (':'|'.'|'~'|'!'|'@'|'$'|'%'|'^'|'&'|'*'
>         |'-'|'+'|'='|'|'|','|'<'|'>'|'/'|'?')
>       )+;
>
-----------------------------------------------------------------------
> ---
> 
> Remove PUNC and you still get 6.3 MB and this warning:
> 
> [09:09:14] warning(206): Huge.g:1:10: Alternative 1: after matching
> input such as
>
'/''*''*''\\''/''*''*''\\''/''*''*''\\''/''*''*''\\''*'{'\u0000'..'\b',
> '\u000B'..'\f', '\u000E'..'\u001F', '!'..')', '+'..'.', '0'..'[',
> ']'..'\uFFFE'}'/''\u0000'..'\uFFFE' decision cannot predict what comes
> next due to recursion overflow to ML_COMMENT from ML_COMMENT
> 
> The strange thing is, the Java target doesn't have a problem. This
> Lexer is "only" 30 KB in Java. I thought StringTemplate was just a
> somewhat superficial conversion process - why the big difference?
> 
> And why doesn't the syntactic predicate help?
> 
> There is no problem if ML_COMMENT is the only rule in the file, but in
> that case the rule doesn't work: apparently it never recognizes the
> closing "*/" as such. Luckily, this solves the problem:
> 
> ML_COMMENT: ('/*')=> '/*' (options{greedy=false;} : ML_COMMENT | .)*
> '*/';
> 
> However, I am very curious to know how "greedy=false" affects the
> subrule "ML_COMMENT"; ANTLR would not let me use simply
> 
> ML_COMMENT: ('/*')=> '/*' (ML_COMMENT | options{greedy=false;} : .)*
> '*/';
> 
> I know this has been long-winded but I hope someone could give me some
> advice about how to get ML_COMMENT, RE_STRING and PUNC to work
> together nicely.
> 
> --
> - David
> http://qism.blogspot.com