[antlr-interest] Solution for specialStateTransition exceeding 65k

Marcus Klimstra mgb.klimstra at gmail.com
Thu May 27 08:06:34 PDT 2010


Hi Jim,

Basically the language has string literals which can contain
'placeholders'; expressions surrounded by angle brackets:

stringLiteral
    :    SQUOTE! stringPart* SQUOTE!
    ;

stringPart
    :    STRCONT
    |    LT! expr XGT!
    ;

expr can also be a string, so 'foo <bar('baz')> quux' would be a valid
expression. The only exception is that '> is not allowed within
placeholders.

The lexer handles this with a stack of 'modes'. All operators and
keywords have a predicate that the current mode must be 'normal' (i.e.
outside a string or in a placeholder). When inside a placeholder the
'>' character yields a XGT token instead of the normal GT, to prevent
it from being cobbled up by a relational expression.

PLUS         :    {inNormal}?=>    '+'        ;
MINUS        :    {inNormal}?=>    '-'        ;
MUL          :    {inNormal}?=>    '*'        ;
DIV          :    {inNormal}?=>    '/'        ;
MOD          :    {inNormal}?=>    '%'        ;
//etc
NOT          :    {inNormal}?=>    'not'      ;
OR           :    {inNormal}?=>    'or'       ;
AND          :    {inNormal}?=>    'and'      ;
TRUE         :    {inNormal}?=>    'true'     ;
FALSE        :    {inNormal}?=>    'false'    ;
//etc

SQUOTE
    :    {inNormal}?=>        '\''    { pushMode(MODE_STRING); }
    |    {inString}?=>        '\''    { popMode(); }
    ;

XGT :    {inPlaceholder}?=>   '>'     { popMode(); }
    ;

GT  :    {inNormal}?=>        '>'
    ;

LT  :                         '<'     { if (inString) {
pushMode(MODE_NORMAL); } }
    ;

STRCONT
    :    {inString}?=>        ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'_')+
    ;

As you can see, at the moment strings can only contain /[a..z][0..9]
_/i, since using (~('\''|'<'))+ results in an OutOfMemoryError...

inNormal, inString and inPlaceholder are booleans which are updated by
pushMode and popMode:

private void updateMode() {
    Integer mode    = stack.peekFirst();
    inNormal        = (stack.isEmpty() || mode == MODE_NORMAL);
    inString        = (mode == MODE_STRING);
    inPlaceholder   = (mode == MODE_NORMAL);
}

Although my current approach seems to work pretty well, I am ofcourse
open for suggestions. I can't really wait for ANTLR v4 however :)

Thanks,

- Marcus

On Thu, May 27, 2010 at 3:50 PM, Jim Idle <jimi at temporal-wave.com> wrote:
> There is  quite often a way to rejig the lexer to avoid the huge expansion, if you post your grammar, maybe we can help. I think that such issues will go away in v4 :-)
>
> Jim
>
>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of Marcus Klimstra
>> Sent: Thursday, May 27, 2010 2:19 AM
>> To: antlr-interest at antlr.org
>> Subject: [antlr-interest] Solution for specialStateTransition exceeding
>> 65k
>>
>> Hi,
>>
>> I ran into the problem of the huge specialStateTransition bytecode size
>> when using many gated semantic predicates in the lexer (in all my lexer
>> rules actually).  After a google search I found that this is a known
>> issue to which there are some workarounds, but no real solutions. At
>> first I used the workaround to manually add local variables for the
>> outer-class references, but at some point even that no longer worked.
>> Therefore I changed the Java code generator to create seperate methods
>> for each switch-case. This works quite well for me, so I wanted to
>> share it with the community. Note that I only tested this in the lexer,
>> since my parser has no specialStateTransition-method at the moment. I
>> also added annotations to suppress the useless warnings in the
>> generated code. A diff-file with these changes is attached.
>>
>> - Marcus
>
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
-------------- next part --------------
===========================================================

stringLiteral
    :    SQUOTE! stringPart* SQUOTE!
    ;

stringPart
    :    STRCONT
    |    LT! expr XGT!
    ;

===========================================================

PLUS         :    {inNormal}?=>    '+'        ;
MINUS        :    {inNormal}?=>    '-'        ;
MUL          :    {inNormal}?=>    '*'        ;
DIV          :    {inNormal}?=>    '/'        ;
MOD          :    {inNormal}?=>    '%'        ;
//etc
NOT          :    {inNormal}?=>    'not'      ;
OR           :    {inNormal}?=>    'or'       ;
AND          :    {inNormal}?=>    'and'      ;
TRUE         :    {inNormal}?=>    'true'     ;
FALSE        :    {inNormal}?=>    'false'    ;
//etc

SQUOTE
    :    {inNormal}?=>        '\''    { pushMode(MODE_STRING); }
    |    {inString}?=>        '\''    { popMode(); }
    ;

XGT :    {inPlaceholder}?=>   '>'     { popMode(); }
    ;
    
GT  :    {inNormal}?=>        '>'
    ;
    
LT  :                         '<'     { if (inString) { pushMode(MODE_NORMAL); } }
    ;

STRCONT
    :    {inString}?=>        ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'_')+
    ;

===========================================================

private void updateMode() {
    Integer mode    = stack.peekFirst();
    inNormal        = (stack.isEmpty() || mode == MODE_NORMAL);
    inString        = (mode == MODE_STRING);
    inPlaceholder   = (mode == MODE_NORMAL);
}

===========================================================


More information about the antlr-interest mailing list