[antlr-interest] Size of generated lexer code

David Goldberg david at gohazel.com
Wed Jul 2 11:19:00 PDT 2008


From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Jim Idle

This usually happens because you either specify alts in a rule in a way that causes huge tables, or you are trying to make the lexer scan for all sorts of character ranges for naming compliance in variables say. When the initial set is the entire unicode range, then of course you end up with huge tables and rulesets.

It is generally a lot easier to accept any old characters for something liek a variable, then validate them. This way you will also generate a semantic message such as "Variable xxxxxy cannot use the character y in its name." instead of the lexer producing errors or a token sequence you are not expecting. Whether you cna do this depends on what your token set is of course. Why don't you post your lexer?

You might also try 3.1beta and see if it makes a difference.



Your suggestion about name compliance in variables seems to have done it. I commented out the Lexer rules below and the generated file shrunk from 7000k to 1500k. I assume if I make the middle part less complicated I could probably save a lot of space and then manually check as you suggest.

//IDENTIFIER
//            :               LETTER ((LETTER | IDDIGIT | '-')* (LETTER | IDDIGIT))? ('?' | '!' | '%' | '@' | '^' )?
//            ;

/*
fragment
LETTER :               '\u0024' |
       '\u0041'..'\u005a' |
       '\u005f' |
       '\u0061'..'\u007a' |
       '\u00c0'..'\u00d6' |
       '\u00d8'..'\u00f6' |
       '\u00f8'..'\u00ff' |
       '\u0100'..'\u1fff' |
       '\u3040'..'\u318f' |
       '\u3300'..'\u337f' |
       '\u3400'..'\u3d2d' |
       '\u4e00'..'\u9fff' |
       '\uf900'..'\ufaff'
    ;
fragment
IDDIGIT
    :  '\u0030'..'\u0039' |
       '\u0660'..'\u0669' |
       '\u06f0'..'\u06f9' |
       '\u0966'..'\u096f' |
       '\u09e6'..'\u09ef' |
       '\u0a66'..'\u0a6f' |
       '\u0ae6'..'\u0aef' |
       '\u0b66'..'\u0b6f' |
       '\u0be7'..'\u0bef' |
       '\u0c66'..'\u0c6f' |
       '\u0ce6'..'\u0cef' |
       '\u0d66'..'\u0d6f' |
       '\u0e50'..'\u0e59' |
       '\u0ed0'..'\u0ed9' |
       '\u1040'..'\u1049'
   ;
*/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080702/981fa3ea/attachment-0001.html 


More information about the antlr-interest mailing list