[antlr-interest] Using String literals in grammar

Thu Feb 9 16:14:55 PST 2012

In case it is useful, I will attach a smallish version of the grammar, with
the purposed (less than ideal) solution and exhibiting the problem of a
function being too large in the lexer because of it.  I am not sure
anything can be done outside of manually fixing the problem after antlr
builds the lexer,

--Jesse Swidler

On Thu, Feb 9, 2012 at 3:17 PM, Jesse Swidler <jswidler at gmail.com> wrote:

> I am trying to write a grammar for ABAP, which is a pretty verbose
> language.
>
> Pretty much nothing is reserved in ABAP.  You can name a variable whatever
> you want.  If your variable is named in such a way that it would make your
> syntax ambiguous, you must add an "!" before your variable name to resolve
> the problem (although the ABAP style guide recommends against naming your
> variables this way, it is not prevented.)  The DATA keyword is used to
> declare variables, so the following is an example statement to define a
> variable named "DATA"
>
> DATA DATA.
>
>
> A greatly simplified grammar would look like:
>
> data:
>      'DATA' fieldDefId  DOT;
>
> fieldDefId: anySingleToken;
>
>
> WS : (' '|'\t')+ {$channel=HIDDEN;};
> NL : '\r'? '\n' {$channel=HIDDEN;};
> DOT: '.';
> INTEGER_LITERAL
> : '0'..'9'+;
> WORD: ~(' '|'\t'|'\r'|'\n'|'.'|':'|','|'('|')'|'<'|'>'|'*'|'-'|'\'')+ ;
>
> anySingleToken: INTEGER_LITERAL | WORD ; //Not really any token, for
> instance DOT is not supposed to be accepted.
>
>
> My problem here is that ANTLR goes ahead and makes a DATA token type
> automatically.  So if you were to try "DATA DATA." - which is most
> definitely legal - it does not work because DATA is not being returned as a
> WORD token like I want it to be.  I would need to make the anySingleToken
> production have an or "DATA" appended to it.  There are about 750 words
> like DATA that would need to be accounted for and included in the
> anySingleToken production if I must create a unique token type for each
> "keyword" type thing in ABAP.  Additionally when I defined this many
> different token types, ANTLR produced a java file which would not compile
> on account of a function containing more than 25,000 lines of code.  So I
> am worried that I have two problems here.
>
> 1) I don't see a way to get the behavior I want without including a large
> production that makes a union of all of these keywords in the language.
>  This would be okay, even if it is not as eloquent as I would like, except
> that;
>
> 2) I am also worried any grammar which uses about 800 token types will
> always be a problem for ANTLR because it creates a function which is not
> allowed by java because it exceeds some maximum number of LOC per function.
>
> What suggestions do people have for solving this issue?
>
>
>
>
> --Jesse Swidler
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ABAP.g
Type: application/octet-stream
Size: 23149 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20120209/0d920733/attachment.obj