[antlr-interest] Huge tables in C lexers

Sun Nov 11 02:17:03 PST 2012

Hey Jim,

> You mis-understand Mike. The situation is pretty much the exact opposite
> of what you suggest. Java has to do what it does because there is no way
> to declare character arrays with pre-initialized data. C/C++ is streets
> ahead in this regard.

I guess you mis-understood me. It's not that I judge one handling over the other wrt execution. My concern is rather the big source file, which takes a lot of time to load into a debugger. This is even worse with an IDE like XCode, which has a hard time to cope with so large files. Also other tools like SCMs get to their limit when you try to view a diff of such huge files, do some pre-commit processing and what not. Runtime behavior is not the only aspect ...

> While these days, the tables are not a big deal, you should still limit
> them by making your lexer leaner. You are probably telling the lexer to
> pick out certain classes of characters (from a spec?).

Only this:

// As defined in http://dev.mysql.com/doc/refman/5.6/en/identifiers.html.
fragment LETTER_WHEN_UNQUOTED:
	'0'..'9'
	| 'A'..'Z' // Only upper case, as we use a case insensitive parser (insensitive only for ASCII).
	| '$'
	| '_'
	| '\u0080'..'\uffff'
;

When I comment out the last alt I get a much smaller lexer source file.

> It is better to
> relax lexer rules and get them to recognize the roughly correct pattern,
> then check the character set and issue a semantic error. If you get a
> lexer error, all you can do is say "invalid character" and it pretty much
> ends the sequence. Move your errors as far in to the tool chain as you
> can, because you will give your users much better context in the errors.

You have said this already many times and I try to follow your advice as much as I can (because it makes a lot of sense) but in such a case it's totally ok to immediately stop processing and say: "nonsense char found, fix it". Why doing a whole chain of processing if it is just a wrong char that the user entered?

To test the outcome I changed the rule above to just:

fragment LETTER_WHEN_UNQUOTED:
	.
;

which not only brought me a warning about 3 other rules which cannot be matched anymore (even though this rule is essentially the last one in my grammar, only followed by 2 virtual tokens), but it also increased the lexer source size to 55MB (from 38MB)! So, relaxing the lexer seems not to be a solution for this problem. The only way to get the size down is to exclude certain input.

Mike
-- 
www.soft-gems.net