[antlr-interest] ANTLR running out of memory during generation

Sat Jan 30 21:18:27 PST 2010

Jim,

Thanks for the response. Yeah, the target language is kind of obvious 
isn't it? What else could have that many keywords?

I might try turning off backtracking later on and see what all I have to 
fix. Right now it's turning out to be a lot easier, and hasn't created 
any performance problems. Also, I'm not concerned with rejecting invalid 
code, only with successfully parsing all valid code, which simplifies 
things.

But the problem I'm having doesn't relate to any specific keyword. I 
even try inserting garbage keywords, with the same result. To me, the 
fact that it runs perfectly fine (and fast) with 631, and apparently 
hits some endless loop/recursion at 632 that makes it run 10x longer and 
run out of memory indicates a bug or implementation limitation. The fact 
that 3.1 and 3.2 behave exactly the same way indicates it's code that 
hasn't changed in the latest release. Unfortunately, I don't know enough 
of ANTLR's internals to be able to track it down, and don't have the 
time now to learn what I need to.

I have run it with 2G heap space. I bumped it up from 512M to 1G then 
2G, and all it accomplished was to make it run a few seconds longer 
before running out of memory. A clear symptom of endless loop/recursion. 
There shouldn't be anything I can do in my grammar that would cause 
ANTLR to act this way.

I'll try those switches and see if they help. For the moment I've been 
able to side step the problem by cutting it down to the set of keywords 
for currently implemented parts of the language, bringing it down to 
about 150 (I had started with the full keyword list that's available, 
and then kept adding all the omissions from that list, of which there 
are many). But ultimately I'll have to find a way to deal with it. I'm 
hoping maybe Terry will have a bug fix for me before that 8^).

Ron

Jim Idle wrote:
> Ron,
>
> First you really need to switch off backtracking unless the objective of your parser is to analyze SQL (you gave it away when you mentioned 632 keywords that can be identifiers). There are not as many predicates required as you think so long as you left factor everything.
>
> Your tokens should be consecutive so long as you list them that way in the lexer. 
>
> The problem might well be that although SQL sort of allows all keywords to be identifiers, it does not allow all because some of them would be to ambiguous even for a syntax directed hand crafted parser. If you turn on backtracking then try to allow one of these reserved words to be an identifier, then you will probably mask the issue because all warnings and errors are turned off. 
>
> It is entirely feasible to create a full SQL parser without backtracking, very little look ahead and few predicates (all of the one or two token lookahead type). I have an online demo of T-SQL for instance on my web site at www.temporal-wave.com  (select 'online demos' link), and Oracle SQL/PLSQL will be up there before long too.
>
> So, I think you will need to do the following to have a chance of generating the code:
>
> 1) Use -Xconversiontimeout 10000
> 2) Cause switches to be generated rather than ifs: -Xmaxswitchcaselabels 32000 -Xminswitchalts 1-xmaxinlineddfastates 65534
> 3) Use -Xmx2G when invoking the java command (assuming your jvm allows that)
>
> But if you cannot get it going that way, then basically you are masking a bigger problem in your grammar that you are not seeing because of global backtracking. 
>
> Jim
>
>   
>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of Ron Hunter-Duvar
>> Sent: Friday, January 29, 2010 8:52 PM
>> To: antlr-interest at antlr.org
>> Subject: [antlr-interest] ANTLR running out of memory during generation
>>
>> I'm having a strange problem with ANTLR. I'm building a grammar for a
>> language with a huge number (hundreds) of non-reserved keywords. I'm
>> using the approach of having the lexer return a different token type
>> for
>> each keyword, and then having a parser rule of the form:
>>
>>     id : ( ID | QUOTED_ID | KW_A | KW_B | ... | KW_ZZZ );
>>
>> This was working great until today. In fact, ANTLR 3.2 generates
>> surprisingly clever code for this - all the keywords are assigned
>> consecutive token numbers, and generated code just says:
>>
>>     if ( (input.LA(1)>=KW_A && input.LA(1)<=KW_ZZZ)||(input.LA(1)>=ID
>> &&
>> input.LA(1)<=QUOTED_ID) ) {
>>         input.consume();
>>         ...
>>
>> This works all the way up to 631 keywords. ANTLR runs in about 20
>> seconds, and never uses more than 269MB of memory. When I add a 632nd
>> keyword (doesn't matter what the keyword is), and change nothing else,
>> ANTLR runs for 2 minutes and runs out of heap space. I kept bumping the
>> max space up, but even going to 2GB doesn't make any difference.
>>
>> What's really interesting is that I was using ANTLR 3.1 until now. When
>> I ran into this I upgraded to 3.2, but both of them fail at exactly the
>> same spot, 632 keywords. Not surprisingly, the stack trace varies from
>> one run to the next, depending on the exact point it runs out of
>> memory,
>> but it always has deeply nested calls to these and other methods:
>>
>>
>> org.antlr.stringtemplate.language.ASTExpr.writeTemplate(ASTExpr.java:75
>> 0)
>>     org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:680)
>>
>> org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:6
>> 60)
>>
>> org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluato
>> r.java:86)
>>     org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:149)
>>
>> org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:705)
>>
>> I don't know if it makes a difference, but I'm using backtracking
>> (otherwise, this approach to non-reserved keywords doesn't work without
>> a lot of synpreds), and outputting ASTs.
>>
>> Since this is size related, it's hard to narrow it down to a simple
>> example. I could try to duplicate it with just the id rule and nothing
>> else.
>>
>> Any ideas what might be happening here, and whether a fix might be
>> possible?
>>
>> Thanks,
>> Ron
>>
>> --
>> Ron Hunter-Duvar | Software Developer V | 403-272-6580
>> Oracle Service Engineering
>> Gulf Canada Square 401 - 9th Avenue S.W., Calgary, AB, Canada T2P 3C5
>>
>> All opinions expressed here are mine, and do not necessarily represent
>> those of my employer.
>>
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
>> email-address
>>     
>
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>   

-- 
Ron Hunter-Duvar | Software Developer V | 403-272-6580
Oracle Service Engineering
Gulf Canada Square 401 - 9th Avenue S.W., Calgary, AB, Canada T2P 3C5

All opinions expressed here are mine, and do not necessarily represent
those of my employer.