[antlr-interest] OutOfMemory parsing large input

Mon Apr 28 08:47:46 PDT 2008

Neil,

I suspect that the issue isn't so much  that, as with 12MB you should not really have any problems, but your line rule can match anything. Before looking any further, try changing you LINE rule to:

LINE: ('a'..'z')|' ')+ ;

And see if that helps at all.

It does tokenize everything, but that should not be causing you out of memory errors.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Neil Bacon
> Sent: Sunday, April 27, 2008 9:25 PM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] OutOfMemory parsing large input
> 
> Hi,
> I've been attempting to parse a large repetitive data file with Antlr,
> but have found a few areas where it seems to buffer up data way ahead
> of what is necessary to do the job. I am about to break up the input
> outside of antlr to work around this unless anyone can offer a better
> idea.
> Antlr v3 is great, but this is an issue I didn't have with Antlr v2.
> 
> Issues:
> 1. First the ANTLR*Stream classes load all the input at once.
> 2. Then, when the first parser rule needs a lexer token, the lexer
> appears to be tokenizing all the input
>    (certainly much more than is required to satisfy the current parser
> rule) and running out of memory.
> 
> I guess I could make my own Stream implementation to address 1. Is
> there any way to address 2?
> 
> Please see below for a very simple grammar to demonstrate the issue
> and a stack trace.
> The test data is 12Mb of lower case text.
> 
> Regards,
>     Neil.
> 
> Simple test grammar:
> 
> list : head body*;
> head : LINE NEWLINE;
> body : LINE NEWLINE;
> LINE : ('a'..'z'|' ')*;
> NEWLINE : '\r'? '\n';
> 
> java.lang.OutOfMemoryError: Java heap space
>     at org.antlr.runtime.Lexer.emit(Lexer.java:161)
>     at org.antlr.runtime.Lexer.nextToken(Lexer.java:111)
>     at
> org.antlr.runtime.CommonTokenStream.fillBuffer(CommonTokenStream.java:1
> 19)
>     at
> org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:238)
>     at
> org.antlr.runtime.CommonTokenStream.LA(CommonTokenStream.java:300)
>     at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:89)
>     at org.cambia.sequence.st25.antlr.AParser.head(AParser.java:90)
>     at org.cambia.sequence.st25.antlr.AParser.list(AParser.java:37)