[antlr-interest] Lexer generated for C# more than 100 times larger than for Java

Tue Jul 24 13:35:06 PDT 2007

Farr, John wrote:
> For a grammar I've been working on, the lexer file generated for
> "language=CSharp" is over 100 times as large as that generated when
> "language=Java". The C# lexer file size 26 MB (26,019,532) whereas the
> Java lexer file is 240 KB (240,282). Obviously such a huge file taxes
> the C# compiler (amazingly, it does build, but ever so slowly).
> 
> The size difference seems to be, at least in part, in the way the "DFA
> transition" tables are generated. For Java these tables are generated as
> Strings; for C# they're generated as arrays of shorts. There may be
> other contributors to the size difference as well.
> 
> It seems peculiar that there would be such a huge difference in
> generated source code size for 2 targets that are so similar. Is there
> any possibility of reducing the size of the generated C# lexer code?
> 
> Thanks,
> John

Try to change the lexer slightly. For whatever reason certain constructs
act differently, even if the recognized language is still the same. I
discovered this by changing

WHITESPACE
	:	WHITESPACE_CHARACTERS
	;

fragment WHITESPACE_CHARACTERS
	:	WHITESPACE_CHARACTER+
	;

to

WHITESPACE
	:	WHITESPACE_CHARACTER+
	;

Unfortunately I don't know how to identify the culprit besides removing
rules temporarily.

Best regards,
Johannes Luber