[antlr-interest] Using String literals in grammar

Jesse Swidler jswidler at gmail.com
Thu Feb 9 15:17:38 PST 2012

I am trying to write a grammar for ABAP, which is a pretty verbose

Pretty much nothing is reserved in ABAP.  You can name a variable whatever
you want.  If your variable is named in such a way that it would make your
syntax ambiguous, you must add an "!" before your variable name to resolve
the problem (although the ABAP style guide recommends against naming your
variables this way, it is not prevented.)  The DATA keyword is used to
declare variables, so the following is an example statement to define a
variable named "DATA"


A greatly simplified grammar would look like:

     'DATA' fieldDefId  DOT;

fieldDefId: anySingleToken;

WS : (' '|'\t')+ {$channel=HIDDEN;};
NL : '\r'? '\n' {$channel=HIDDEN;};
DOT: '.';
: '0'..'9'+;
WORD: ~(' '|'\t'|'\r'|'\n'|'.'|':'|','|'('|')'|'<'|'>'|'*'|'-'|'\'')+ ;

anySingleToken: INTEGER_LITERAL | WORD ; //Not really any token, for
instance DOT is not supposed to be accepted.

My problem here is that ANTLR goes ahead and makes a DATA token type
automatically.  So if you were to try "DATA DATA." - which is most
definitely legal - it does not work because DATA is not being returned as a
WORD token like I want it to be.  I would need to make the anySingleToken
production have an or "DATA" appended to it.  There are about 750 words
like DATA that would need to be accounted for and included in the
anySingleToken production if I must create a unique token type for each
"keyword" type thing in ABAP.  Additionally when I defined this many
different token types, ANTLR produced a java file which would not compile
on account of a function containing more than 25,000 lines of code.  So I
am worried that I have two problems here.

1) I don't see a way to get the behavior I want without including a large
production that makes a union of all of these keywords in the language.
 This would be okay, even if it is not as eloquent as I would like, except

2) I am also worried any grammar which uses about 800 token types will
always be a problem for ANTLR because it creates a function which is not
allowed by java because it exceeds some maximum number of LOC per function.

What suggestions do people have for solving this issue?

--Jesse Swidler

More information about the antlr-interest mailing list