[antlr-interest] problem about "the code for the static initializer is exceeding the 65535 bytes limit"

Wed Aug 15 13:08:25 PDT 2012

Hi Jim

With all respect I have for you, you cannot say that the only problem is 
a poorly designed grammar.

First of all, I would suggest you to look at language such as Cobol or 
Natural or esoteric third level language to take the "problem" in 
scope.  Just as an exemple, Natural allows this kind of syntaxes

99 / 99                   which means : divide 99 by 99
99/99                    which is a mask for date number editing

The real solution for this kind of expressions should be to let to the 
lexer do the job with contextual predicates as the WHITE token is 
generally ignored. If due to the 64K limitation, one should use a parser 
rule instead of lexer rules then the WHITE token becomes fully 
meaningfull and should be expressed in ALL rules of the grammar...which 
is really a painfull change since ANTR2 was working fine with contextual 
semantic predicates in the lexer rules.

Secondly, ANTLR as a generic and general compiler's compiler tool should 
be able to produce lexer and parser even for poorly written grammar if 
such grammar respect the specification of the meta langage.

Third, the 64K problem is really a Java problem linked to the inlining 
of the DFA classes into the lexer and parser. As extracting the DFAs 
outside the generated lexer and parser solve this issue, I do not see 
why one should reject this option since it improves the capability of 
ANTLR without compromising its functionnal offer.

Fourth, the software should adapt at its best to the human and not the 
contrary. That's why compilers have all an optimisation phase so that 
people could write for example i = i + 1; instead of i++ which is the 
cleaner and proper readable way to increment an integer. So it should be 
as much as possible the same for ANTLR for accepting grammar that are 
not overly left factored to overcome a Java limitation.

Terr, what's your position on this??

Francis

Le 15/08/2012 20:38, Jim Idle a écrit :
> It does not need a fix. It is the grammar that needs to be improved. The
> huge DFAs are indicative of your grammars being overly complicated or poorly
> left factored. ANTLR might do better than it does in some cases, and v4 may
> well get around a lot of similar issues, but in general, improve your
> grammar files.
>
> First, look at the generated DFA. What rule, or combination of rules is
> generating this? Start there. Left factor. Simplify. Stop trying to do much
> of anything in the lexer other than match the simplest common token set.
> Stop trying to impose semantics in the parser ("you can only have at most
> two of 'these' here" - push such things in the tree walk, or add semantic
> checks (allow any number of 'these', count how many you got, then issue a
> semantic error).
>
> Writing good grammars is not easy. In some ways, because it is easy to just
> type stuff in and give it a whirl, ANTLR can cause you to shoot yourself in
> the foot!
>
> Step back and consider your grammar files. Do you really want a grammar that
> generates such huge decision tables? What is going wrong? It usually is not
> ANTLR itself.
>
>
> Jim
>
>
>> -----Original Message-----
>> From:antlr-interest-bounces at antlr.org  [mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of Francis ANDRE
>> Sent: Wednesday, August 15, 2012 10:14 AM
>> To: Zhaohui Yang
>> Cc:antlr-interest at antlr.org
>> Subject: Re: [antlr-interest] problem about "the code for the static
>> initializer is exceeding the 65535 bytes limit"
>>
>> Le 15/08/2012 16:17, Zhaohui Yang a écrit :
>>> It's great someone is already trying a fix. I'd be glad to test your
>>> fix when it's out.
>>>
>>> Would you please introduce a bit what kind of fix is that? Is it for
>>> ANTLRWorks or ANTLR tool, is it a command line option for seperating
>>> FOLLOW set or supressing that, or something else?
>> The 64K syndrone is a pure Java problem due to the constraint that the
>> JVM does not support static initializer greater than 64K  -- shame on
>> it --. Thus if you look to the generated lexer and parser, you will see
>> certainly a lot of DFA classes, each of them having some static
>> initializer values. The point is that the sum of the static initializer
>> of all those DFAs is greater than 64K while the static initialization
>> of each DFA is somewhat small or in most of case les than 64K. Thus,
>> one solution is to extract all those DFAs classes and put them outside
>> the lexer or the parser in fixed directories like the following
>> pattern:
>>
>> Let <grammar> the directory of the grammar to generate, then all the
>> generated DFAs will go in
>>
>> for the lexer's DFAs:    package <grammar>.lexer;
>> for the parser's DAFs: package <grammar>.parser;
>>
>> and the reference of all those DFAs will be
>> in the lexer:                 import <grammar>.lexer.*;
>> in the parser                import <grammar>.parser.*;
>>
>> But hold on, the fix has to be approved by Terr and I did not yet
>> submit it. It need to pass all unit tests of the ANTLR3.4 and I am
>> working on it... there is a real challenge getting the parser/lexer
>> compiled for java code generated without a package...; and all those
>> unit tests are producing java parser/lexer at the top level directory.
>>> 2012/8/15 Francis ANDRE <francis.andre.kampbell at orange.fr
>>> <mailto:francis.andre.kampbell at orange.fr>>
>>>
>>>      Hi Zhaohui
>>>
>>>      I am currently working on fixing this issues with antlr3.4...
>> Once
>>>      I will have a proper patch, would you be interested in testing
>> it??
>>>      FA
>>>      Le 14/08/2012 18:05, Zhaohui Yang a écrit :
>>>
>>>          Hi,
>>>
>>>          Here we have a big grammar and the generated parser.java got
>> a
>>>          compilation
>>>          : "the code for the static initializer is exceeding the 65535
>>>          bytes limit".
>>>
>>>          I've searched the net for a while and found that is a widely
>>>          known limit in
>>>          JVM or Javac compiler, and not yet has an option to change it
>>>          higher.
>>>
>>>          On the ANTLR side, I found 2 solutions proposed by others,
>> but
>>>          neither of
>>>          them is totally satisfying:
>>>
>>>          1. Seperate the big grammar into 2 *.g files, import one from
>>>          the other.
>>>              Yes, this removes the compilation error with genereated
>>>          Java. But
>>>          ANTLRWorks does not support imported grammar well. E.g., I
>> can not
>>>          interpret a rule in the imported grammar, it's simply not in
>>>          the rule list
>>>          for interpreting. And gunit always fail with rules defined in
>>>          imported
>>>          grammar.
>>>
>>>          2. Modify the generated Java source, seperate the
>>>          "FOLLOW_xxx_in_yyy"
>>>          constants into several static classes and change references
>> to
>>>          them
>>>          accordingly.
>>>              This is proposed here -
>>>          http://www.antlr.org/pipermail/antlr-interest/2009-
>> November/036608.html
>>>          .
>>>          The author of the post actually has a solution into ANTLR
>>>          source code (some
>>>          string template). But I can't find the attachment he referred
>>>          to. And
>>>          that's in 2009, I suspect the fix could be incompatible with
>>>          current ANTLR
>>>          version.
>>>              Without this fix we have to do the modificaiton manually
>>>          or write a
>>>          script for that. The script is not that easy.
>>>
>>>          And we found a 3rd solution by ourself, that also involve
>>>          changing the
>>>          generated Java:
>>>
>>>          3. Remove those FOLLOW_... constant completely, and replace
>>>          the references
>>>          with "null".
>>>              Surprisingly this works, just no error recovery after
>>>          this, not a
>>>          problem for us. But we really worry this is unsafe, since
>> it's not
>>>          documented anywhere.
>>>
>>>          After all, we're looking for any other solution that is
>> easier
>>>          to apply,
>>>          asumming we'll be constantly changing the grammar and
>>>          recompile the parser.
>>>
>>>            Maybe there is a way to get ANTLRWorks and gunit play well
>>>          with imported
>>>          grammar?
>>>          Maybe there is already a commandline option for antlr Tool,
>>>          that can
>>>          genereate FOLLOW_... constants in seperate classes?
>>>          Maybe there is already a commandline option for antlr Tool,
>>>          that can
>>>          supress FOLLOW_... constants code generation?
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Yang, Zhaohui
>>>
>> List:http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:http://www.antlr.org/mailman/options/antlr-interest/your-
>> email-address
> List:http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:http://www.antlr.org/mailman/options/antlr-interest/your-email-address