[antlr-interest] Lexer Code and 65535 bytes limit
Andreas Bartho
andreas.bartho at inf.tu-dresden.de
Wed Sep 5 08:19:09 PDT 2007
Hi,
when creating a Lexer I noticed that the generated method mTokens() can
be quite large, depending on how many tokens are specified, resulting in
an error (because Java methods cannot be > 65535 bytes). Given the
following grammar:
lexer grammar StupidHack;
@lexer::header {
package csharp.parser;
}
FALSE : 'false' ;
TRUE : 'true' ;
DEFINE : 'define' ;
UNDEF : 'undef' ;
IF : 'if' ;
ELIF : 'elif' ;
ELSE : 'else' ;
ENDIF : 'endif' ;
ERROR : 'error' ;
WARNING : 'warning' ;
REGION : 'region' ;
ENDREGION : 'endregion' ;
PRAGMA : 'pragma' ;
LINE : 'line' ;
LOGICALOR : '||' ;
LOGICALAND : '&&' ;
EXCLAM : '!' ;
EQUALS : '==' ;
NOTEQUALS : '!=' ;
LPAREN : '(' ;
RPAREN : ')' ;
NUMBER : '#' ;
Identifier
: Lettercharacter+
;
fragment
Lettercharacter
: '\u0041'..'\u005a'
| '\u0061'..'\u007a'
| '\u00aa'
| '\u00b5'
/* snip */
| '\uffc2'..'\uffc7'
| '\uffca'..'\uffcf'
| '\uffd2'..'\uffd7'
| '\uffda'..'\uffdc'
;
something like the following is created:
public void mTokens() throws RecognitionException {
int alt2=23;
switch ( input.LA(1) ) {
case 'f':
{
int LA2_1 = input.LA(2);
if ( (LA2_1=='a') ) {
int LA2_19 = input.LA(3);
if ( (LA2_19=='l') ) {
int LA2_33 = input.LA(4);
if ( (LA2_33=='s') ) {
int LA2_46 = input.LA(5);
if ( (LA2_46=='e') ) {
int LA2_59 = input.LA(6);
if ( ((LA2_59>='A' &&
LA2_59<='Z')||(LA2_59>='a' &&
LA2_59<='z')||LA2_59=='\u00AA'||LA2_59=='\u00B5'||(LA2_59>='\uFFC2' &&
LA2_59<='\uFFC7')||(LA2_59>='\uFFCA' &&
LA2_59<='\uFFCF')||(LA2_59>='\uFFD2' &&
LA2_59<='\uFFD7')||(LA2_59>='\uFFDA' && LA2_59<='\uFFDC')) ) {
alt2=23;
}
else {
alt2=1;}
}
else {
alt2=23;}
}
else {
alt2=23;}
}
else {
alt2=23;}
}
else {
alt2=23;}
}
break;
case 't':
{
int LA2_2 = input.LA(2);
if ( (LA2_2=='r') ) {
int LA2_20 = input.LA(3);
....
This becomes very huge very fast.
If I add some kind of dummy rule at the end of the grammar, the
generated code is completely different and the size problem does not occur.
New rule:
STUPID_HACK
: 'a'+ '.'
;
Generated code:
public void mTokens() throws RecognitionException {
int alt3=24;
alt3 = dfa3.predict(input);
switch (alt3) {
case 1 :
// StupidHack.g:1:10: FALSE
{
mFALSE();
}
break;
case 2 :
// StupidHack.g:1:16: TRUE
{
mTRUE();
}
break;
case 3 :
// StupidHack.g:1:21: DEFINE
{
mDEFINE();
}
break;
case 4 :
// StupidHack.g:1:28: UNDEF
{
mUNDEF();
}
break;
...
What is the reason for the different behaviour, and would it be possible
to discard the first type of generated code completely?
Andreas
More information about the antlr-interest
mailing list