[antlr-interest] Big grammar => static initializer/method size is exceeding the 65535 bytes limit

Thu Nov 5 07:20:10 PST 2009

> From: Loring Craymer [mailto:lgcraymer at yahoo.com]
> 
> Even if the explosion reflects an ill-formed grammar, do you really
> think that it should be detected by the Java compiler?  

Oh, please don't get me wrong. None of what I am saying here is in the vein of "stupid users should know better..."; far from it. I try to spend a lot of time helping people get started, as most of you know.

Of course, we might say "Why is Java byte code so stupid as to be limited to 32K, why does Java not have static initialization of arrays (well, I think it will in 7) and so on. Basically we all have to deal with what we have right now and so a good discussion helps everyone.

> The user
> experience is then that "ANTLR generates bad code" and "I have no clue
> as to how to fix the problem". 

Hence the list of course - I don't wish to give any impression that I am telling people it's entirely their fault and go away ;-) Besides, it isn't really my place to make such statements.

> Showstoppers like this slow the
> development process; it is far better if ANTLR "always" generates
> compilable code 

Without a doubt, to the extent that this is feasible.

>and ideally produces useful warnings that support
> successive refinement of a grammar.

See my current project rewriting the ANTLR grammars [below].

>  Besides that, this behavior is
> limited to the Java target; shouldn't all targets behave similarly?

Well no, because ANTLR isn't really generating anything incorrectly, it is just that Java has a limitation that causes hoops to be jumped through. Whether ANTLR can do better in any particular case is of course a different issue. We just heard Ter say that he will be working on code generation when he is back from Tokyo and we have been talking about code optimization for a while. 

At the moment though Ter is redoing StringTemplate so that is self contained and has no ANTLR 2 reliance and I am rewriting all the grammars in v3 to be self hosted. One reason for the latter is to solve other user related issues (you mention some of them above) such as error reporting being less than optimal right now. Good reporting will help the user experience a lot. This stuff will probably be a 3.4 release. The structure will allow me to look for patterns in the way things are put together and make some suggestions about improvements. A lint-like mode for the tool might be a cool thing for instance.

> 
> Of course, you could be right that the explosion is a feature.  

Well, not a feature of ANTLR of course, other than the fact that all bugs are unintended features :-) I meant a reflection of the fact that the grammar is structured in a particular way and of course even this might be that ANTLR isn't doing the best job of dealing with such a construct. 

I think that having the LL(*) algorithm is just one thing that shows ANTLR tries to aim at practicality and the Nike development system : "Just Do It!", so of course when you have something like this coming up (and it comes up a lot), then we should really do something about it. Resources are always in short supply of course.

> If so,
> why haven't you attempted to replicate this behavior for generated C
> code?  :)

It's worth mentioning the C target in this respect because I sometimes get questions about why the C object code seems so much bigger than the Java code. In C, the tables are generated such that the compiler can lay out all the tables in data space and there is no runtime initialization code. Hence the DFAs are not run length encoded but laid out just as they are. So, Java looks like it is smaller but in the end is not as efficient and creates all the memory at startup time. C does not have this 32K object code limit (and really Java never should have done either) and so you never get this issue there.

However table driven DFAs are pretty inefficient, especially for C and JIT produced machine code as they tend to defeat the CPU data cache almost immediately and almost always. So the C target now avoids DFA tables as much as possible and produces switch statements. These produce smaller code and don't fail cache lookups and so the C target skips the hassle altogether. We should also be looking at this for Java and C#, but the difference for Java is of course this 32K limit on object code, so it starts to get all target specific on us. A solution needs to be target agnostic if at all possible and many code generation issues that affect one target can be solved in a way that also improves other targets, or at least does not hurt them.

In short then, nobody is ignoring this and the solutions submitted may well be the ones adopted, or maybe there are others.

But, if you all switch to C.... ;-)

Jim
I see there are tons of emails on all this and I will try to answer them all.