[antlr-interest] Lexer Code and 65535 bytes limit

Andreas Bartho andreas.bartho at inf.tu-dresden.de
Wed Sep 5 08:19:09 PDT 2007


Hi,

when creating a Lexer I noticed that the generated method mTokens() can 
be quite large, depending on how many tokens are specified, resulting in 
an error (because Java methods cannot be > 65535 bytes). Given the 
following grammar:

lexer grammar StupidHack;

@lexer::header {
package csharp.parser;
}


FALSE : 'false' ;
TRUE : 'true' ;
DEFINE : 'define' ;
UNDEF : 'undef' ;
IF : 'if' ;
ELIF : 'elif' ;
ELSE : 'else' ;
ENDIF : 'endif' ;
ERROR : 'error' ;
WARNING : 'warning' ;
REGION : 'region' ;
ENDREGION : 'endregion' ;
PRAGMA : 'pragma' ;
LINE : 'line' ;
LOGICALOR : '||' ;
LOGICALAND : '&&' ;
EXCLAM : '!' ;
EQUALS : '==' ;
NOTEQUALS : '!=' ;
LPAREN  : '(' ;
RPAREN  : ')' ;
NUMBER : '#' ;

Identifier
     :   Lettercharacter+
     ;

fragment
Lettercharacter
     :   '\u0041'..'\u005a'
     |   '\u0061'..'\u007a'
     |   '\u00aa'
     |   '\u00b5'

	/* snip */

     |   '\uffc2'..'\uffc7'
     |   '\uffca'..'\uffcf'
     |   '\uffd2'..'\uffd7'
     |   '\uffda'..'\uffdc'
     ;


something like the following is created:

     public void mTokens() throws RecognitionException {
         int alt2=23;
         switch ( input.LA(1) ) {
         case 'f':
             {
             int LA2_1 = input.LA(2);

             if ( (LA2_1=='a') ) {
                 int LA2_19 = input.LA(3);

                 if ( (LA2_19=='l') ) {
                     int LA2_33 = input.LA(4);

                     if ( (LA2_33=='s') ) {
                         int LA2_46 = input.LA(5);

                         if ( (LA2_46=='e') ) {
                             int LA2_59 = input.LA(6);

                             if ( ((LA2_59>='A' && 
LA2_59<='Z')||(LA2_59>='a' && 
LA2_59<='z')||LA2_59=='\u00AA'||LA2_59=='\u00B5'||(LA2_59>='\uFFC2' && 
LA2_59<='\uFFC7')||(LA2_59>='\uFFCA' && 
LA2_59<='\uFFCF')||(LA2_59>='\uFFD2' && 
LA2_59<='\uFFD7')||(LA2_59>='\uFFDA' && LA2_59<='\uFFDC')) ) {
                                 alt2=23;
                             }
                             else {
                                 alt2=1;}
                         }
                         else {
                             alt2=23;}
                     }
                     else {
                         alt2=23;}
                 }
                 else {
                     alt2=23;}
             }
             else {
                 alt2=23;}
             }
             break;
         case 't':
             {
             int LA2_2 = input.LA(2);

             if ( (LA2_2=='r') ) {
                 int LA2_20 = input.LA(3);
              ....

This becomes very huge very fast.
If I add some kind of dummy rule at the end of the grammar, the 
generated code is completely different and the size problem does not occur.

New rule:

STUPID_HACK
     :   'a'+ '.'
     ;

Generated code:

    public void mTokens() throws RecognitionException {
         int alt3=24;
         alt3 = dfa3.predict(input);
         switch (alt3) {
             case 1 :
                 // StupidHack.g:1:10: FALSE
                 {
                 mFALSE();

                 }
                 break;
             case 2 :
                 // StupidHack.g:1:16: TRUE
                 {
                 mTRUE();

                 }
                 break;
             case 3 :
                 // StupidHack.g:1:21: DEFINE
                 {
                 mDEFINE();

                 }
                 break;
             case 4 :
                 // StupidHack.g:1:28: UNDEF
                 {
                 mUNDEF();

                 }
                 break;
                ...

What is the reason for the different behaviour, and would it be possible 
to discard the first type of generated code completely?

Andreas


More information about the antlr-interest mailing list