[antlr-interest] Fun with ANTLR3: mystery of the huge lexer (C#)
Jim Idle
jimi at temporal-wave.com
Sat Jun 30 09:47:15 PDT 2007
David,
Your ML_COMMENT needs to be a fragment rule and you need a predicate to
stop '.' interfering with ML_COMMENT. I just produce this rule for my
T-SQL lexer in fact (C here but the predicate is just input.LA(n) for
Java):
// A multiline comment is akin to a C style comment and is bounded
// by /* and */. However the T-SQL lexer allows for, and checks
// embedded comments. See how here we use a fragment rule to define
// the lexical construct, as this does not try to create tokens and
// hence can be called recursively by itself. The actual token making
// rule here then, just calls that fragment rule.
//
ML_COMMENT
: ML_COMFRAG
{
$channel = HIDDEN;
}
;
// This rule is a fragment so that it can call itself recursively
// and deal with multiple embedded comments.
//
fragment ML_COMFRAG
:
'/*' ( options { greedy=false;}
// The predicate looks for the
start of an embedded comment
// and this triggers a recursive
call of this rule
// and therefore automatically
matches /* and */ pairs.
//
: {(LA(1)== '/' && LA(2) ==
'*')}? ML_COMFRAG
| .
)* '*/'
;
That should help with that part. Then is your PUNC rule something that
returns a token, or are you using that somewhere else too?
Check your lexer rules and if you use one rule inside another, ensure
that it is a fragment rule. My guess is that will cure most of your
issues here.
Jim
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of David Piepgrass
> Sent: Saturday, June 30, 2007 9:16 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Fun with ANTLR3: mystery of the huge lexer
> (C#)
>
> When I attempted to generate my "first try" at making a Lexer, ANTLR
> produced a 21.9 megabyte source file (310,643 lines) filled mostly
> with variables like dfa32_transition15; each transition array for the
> most part contains the same number repeated thousands of times.
>
> I was curious to know if the grammar actually worked, but after 15
> minutes Visual C# still had not finished compiling it, though it was
> using 1.6 GB of virtual memory. I killed it since the whole computer
> was unresponsive (a longstanding flaw in Windows).
>
> The problem seems related to these two rules, because without them, no
> transition arrays are created at all:
>
> ML_COMMENT: '/*' (ML_COMMENT | .)* '*/' { $channel = HIDDEN; };
>
> PUNC: ( ('\\' WS_CHAR)=>'\\'
> | (':'|'.'|'~'|'!'|'@'|'$'|'%'|'^'|'&'|'*'
> |'-'|'+'|'='|'|'|','|'<'|'>'|'/'|'?')
> )+;
>
> If PUNC is in the grammar but not ML_COMMENT, a 2.38 MB file comes
> out. If ML_COMMENT is present but not PUNC, a 8.16 MB file comes out.
> I get a strange warning for ML_COMMENT (when PUNC is absent; if PUNC
> is present, it gets even stranger):
>
> [08:35:47] warning(206): Huge.g:1:10: Alternative 4: after matching
> input such as '/''*''/''*''/''*''/''*''/'<EOT> decision cannot predict
> what comes next due to recursion overflow to ML_COMMENT from
> ML_COMMENT
>
> Puzzling, yes? But with some fiddling, another warning suggests that
> the problem has something to do with backslashes:
>
> warning(206): Huge.g:1:10: Alternative 4: after matching input such as
> '/''*''*''\\''/''*''*''\\''/''
> *''*''\\''/''*''*''\\''*'{'\u0000'..'\b', '\u000B'..'\f',
> '\u000E'..'\u001F', '!'..')', '+'..'.', '0
> '..'[', ']'..'\uFFFE'}'/''\u0000'..'\uFFFE' decision cannot predict
> what comes next due to recursion overflow to ML_COMMENT from
> ML_COMMENT
>
> I realize it comes from my regular expression rule:
>
> RE_STRING: {input.LA(2) != '*'}? '/' (RE_CHAR)+ '/'; // | '@/'
> (RE_CHAR | WS_CHAR) '/';
>
> RE_STRING: '/' (RE_CHAR)+ '/';
> fragment RE_CHAR: RE_ESC | ~('/' | '\\' | '\r' | '\n' | ' ' | '\t' );
> fragment RE_ESC: '\\' .;
>
> So there's an ambiguity for something like /*\*/: is it a comment or a
> regex? So I try
>
> RE_STRING: {input.LA(2) != '*'}? '/' (RE_CHAR)+ '/';
>
> 19 MB.
>
> ML_COMMENT: ('/*') => '/*' (ML_COMMENT | .)* '*/' { $channel = HIDDEN;
> };
>
> 8.16 MB.
>
> In fact, this is all it takes to get a 19 MB output file:
>
>
-----------------------------------------------------------------------
> ---
> lexer grammar Huge;
> options
> {
> language = 'CSharp';
> k = *;
> }
> fragment WS_CHAR: ' ' | '\t';
> ML_COMMENT: ('/*')=> '/*' (ML_COMMENT | .)* '*/';
> RE_STRING: '/' (RE_CHAR)+ '/';
> fragment RE_CHAR: RE_ESC | ~('/' | '\\' | '\r' | '\n' | ' ' | '\t' );
> fragment RE_ESC: '\\' .;
> PUNC: ( ('\\' WS_CHAR)=>'\\'
> | (':'|'.'|'~'|'!'|'@'|'$'|'%'|'^'|'&'|'*'
> |'-'|'+'|'='|'|'|','|'<'|'>'|'/'|'?')
> )+;
>
-----------------------------------------------------------------------
> ---
>
> Remove PUNC and you still get 6.3 MB and this warning:
>
> [09:09:14] warning(206): Huge.g:1:10: Alternative 1: after matching
> input such as
>
'/''*''*''\\''/''*''*''\\''/''*''*''\\''/''*''*''\\''*'{'\u0000'..'\b',
> '\u000B'..'\f', '\u000E'..'\u001F', '!'..')', '+'..'.', '0'..'[',
> ']'..'\uFFFE'}'/''\u0000'..'\uFFFE' decision cannot predict what comes
> next due to recursion overflow to ML_COMMENT from ML_COMMENT
>
> The strange thing is, the Java target doesn't have a problem. This
> Lexer is "only" 30 KB in Java. I thought StringTemplate was just a
> somewhat superficial conversion process - why the big difference?
>
> And why doesn't the syntactic predicate help?
>
> There is no problem if ML_COMMENT is the only rule in the file, but in
> that case the rule doesn't work: apparently it never recognizes the
> closing "*/" as such. Luckily, this solves the problem:
>
> ML_COMMENT: ('/*')=> '/*' (options{greedy=false;} : ML_COMMENT | .)*
> '*/';
>
> However, I am very curious to know how "greedy=false" affects the
> subrule "ML_COMMENT"; ANTLR would not let me use simply
>
> ML_COMMENT: ('/*')=> '/*' (ML_COMMENT | options{greedy=false;} : .)*
> '*/';
>
> I know this has been long-winded but I hope someone could give me some
> advice about how to get ML_COMMENT, RE_STRING and PUNC to work
> together nicely.
>
> --
> - David
> http://qism.blogspot.com
More information about the antlr-interest
mailing list