[antlr-interest] ANTLR Parser file different on different machines - method to exceed 65535 characters

Tue Aug 23 08:25:15 PDT 2011

I think that your suspicions are correct. Complex predicates lead to 
complex lookahead, and complex keyword matches in the lexer lead to lots 
of function calls. Simplifying these things wherever you can will help. 
Because you are only having this problem with limited resources, it 
means that ANTLR is not getting stuck in an infinite loop at least. Your 
grammar is just very complex and requires lots of time and resources to 
process.

Someone else may be able to give you better advice on how to improve 
your predicates. The short answer is to make them as simple as they can 
be (as few lookahead tokens as are necessary to make the decision).

I can tell you from experience that if the language keywords are 
case-insensitive, there is a far better way than what you are doing. The 
trick is to define the tokens in all uppercase

PROCEDURE : 'PROCEDURE';

Then override the LA() function so that it returns in uppercase as well. 
I've only done this with the C target, so I won't be much help with 
specifics for the Javascript target. I'd suggest searching 
antlr.markmail.org for more info on this: 
http://markmail.org/message/yxjncd3vvew2vplu

- Justin

On 8/23/2011 10:59 AM, Luke Tucker wrote:
> Thanks for you response. The good news is that based on your comments, I have now managed to reproduce the problem at least. I ran the generation on a virtual machine with reduced memory and resources and got the same problem.
>
>
> I will try to use the -Xwatchconversion option to try to identify places where I need to improve the grammar. Am I correct in assuming that this might be related to using predicates with high-level rules? i.e. along the lines of "(expression)=>"
>
>
> Another possibility is that the STL language I am parsing has case-insensitive keywords therefore instead of setting up tokens the normal way, I have had to set up tokens for keywords in the following format:
>
> PROCEDURE    :    ('P'|'p')('R'|'r')('O'|'o')('C'|'c')('E'|'e')('D'|'d')('U'|'u')('R'|'r')('E'|'e');
>
>
> Could this be adding a lot of extra complexity? If so I would love to find a better way of doing this.
>
> The reason I am using ANTLR 3.2 is that this was the latest version available when development on the project began and now this version has become part of ESA's platform definition for Linux servers. If I can get this changed to ANTLR 3.4, will this work better with reduced memory or will the same thing occur?
>
> Luke
>
>
>
> ________________________________
> From: Justin Murray<jmurray at aerotech.com>
> To: antlr-interest at antlr.org
> Sent: Tuesday, 23 August 2011, 15:57
> Subject: Re: [antlr-interest] ANTLR Parser file different on different machines - method to exceed 65535 characters
>
> Hi Luke,
>
> I may not be the best person to answer, but I do have some suggestions.
> My guess as to why you are seeing different behavior on different
> machines is that the machines probably have different amounts of memory
> available for ANTLR to use. You are requesting 1GB of stack and heap
> each for the JVM, but if you are running on a machine with limited ram
> available, there will be issues.
>
> The reason you are using so much ram and time to generate the parser is
> probably because of some ambiguities in your grammar. Is there a reason
> why you are using ANTLR 3.2 instead of 3.3 or 3.4? There were changes
> made in 3.3 to eliminate the need for ever using -Xconversiontimeout.
> See the Important things to note section on this page:
> http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3.3+Release+Notes. You
> should try using -Xwatchconversion to figure out where exactly your
> grammar is getting stuck, and try to find and fix the ambiguity.
>
> Hope that helps,
>
> - Justin
>
> On 8/23/2011 6:27 AM, Luke Tucker wrote:
>> Hi, first of all, a little bit of background that you might find interesting in terms of what ANTLR is being used for...
>>
>> I have been using ANTLR to convert a language called STL into JavaScript. This STL language is used to define control procedures for commanding European Space Agency ground station monitoring and control equipment. This is a custom language only used by ESA and is quite quirky in terms of grammar. The existing system is around 15 years old and is being replaced with a new system. These STL procedures are being converted into equivalent in JavaScript which will run in a Rhino engine. My grammar for doing this conversion is more or less complete and the new JavaScripts are successfully commanding ground station equipment running in a simulator. I should also mention that the target language for the grammar is Java and that I am using antlr-3.2.
>>
>>
>> Now for the problem and the reason for my message...
>>
>> As part of the build process for the application doing the conversion, ANTLR is run to generate the lexer and parser files from ant before building the Java application. Previously I had been generating the Parser and Lexer in our environment and committing the resulting java files to the repository rather than performing the generation as part of the build. This works perfectly well here on our development machines on a number of different machines and even a virtual machine with a small amount of memory.
>>
>> Typically however, when we deliver the software to ESA and they try to run the build process on their machines, the Parser file produced is different and the build fails. The reason for the build failure is that "static final String DFA53_specialS" and "static final String DFA53_transitionS" arrays are being produced with a huge number elements, together with a massive switch statement in a "specialStateTransition()" method that is causing the method to exceed 65535 characters.
>> I have read that this can occur with complex and/or unoptimised grammar and I will be the first to admit that the grammar I have written might not be 100% optimised. Since the parser generation works on our machines and not on ESAs, my limited ANTLR-foo is not the root cause of this problem. I have also confirmed that the version of ANTLR being used is exactly the same.
>>
>> I have done a lot of searching through the mailing list archives and found a suggestion that using -Xconversiontimeout 100000 as an input to ANTLR might help solve this issue, but that doesn't seem to be helping. Just in case I did this wrong, this is how I used this option to generate the file:
>>
>> java -Xms1024M -Xmx1024M -jar antlr-3.2.jar -Xconversiontimeout 100000
>>
>> Is there anything else that might cause a large switch block in specialStateTransition() on one machine and not another with the same grammar/same ANTLR version? Should I get them to try with even higher values for Xconversiontimeout or am I barking up the wrong tree with this?
>>
>> If necessary I can post the grammar I am using (minus the inline code) and an example of the STL language being parsed, but I don't think that's necessary at this stage. In any case, it will probably hurt your eyes.
>>
>> Thanks very much in advance for any help.
>>
>> Luke
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address