[antlr-interest] OutOfMemory parsing large input

Sun Apr 27 23:37:56 PDT 2008

Neil,

The CommonTokenStream will tokenize the complete input stream and cache the tokens in a list when you ask it for the first token. So you will have to replace that one as well.

Gr, patrick.

-----Original Message-----
From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Neil Bacon
Sent: maandag 28 april 2008 6:25
To: antlr-interest at antlr.org
Subject: [antlr-interest] OutOfMemory parsing large input

Hi,
I've been attempting to parse a large repetitive data file with Antlr,
but have found a few areas where it seems to buffer up data way ahead
of what is necessary to do the job. I am about to break up the input
outside of antlr to work around this unless anyone can offer a better
idea.
Antlr v3 is great, but this is an issue I didn't have with Antlr v2.

Issues:
1. First the ANTLR*Stream classes load all the input at once.
2. Then, when the first parser rule needs a lexer token, the lexer
appears to be tokenizing all the input
   (certainly much more than is required to satisfy the current parser
rule) and running out of memory.

I guess I could make my own Stream implementation to address 1. Is
there any way to address 2?

Please see below for a very simple grammar to demonstrate the issue
and a stack trace.
The test data is 12Mb of lower case text.

Regards,
    Neil.

Simple test grammar:

list : head body*;
head : LINE NEWLINE;
body : LINE NEWLINE;
LINE : ('a'..'z'|' ')*;
NEWLINE : '\r'? '\n';

java.lang.OutOfMemoryError: Java heap space
    at org.antlr.runtime.Lexer.emit(Lexer.java:161)
    at org.antlr.runtime.Lexer.nextToken(Lexer.java:111)
    at org.antlr.runtime.CommonTokenStream.fillBuffer(CommonTokenStream.java:119)
    at org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:238)
    at org.antlr.runtime.CommonTokenStream.LA(CommonTokenStream.java:300)
    at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:89)
    at org.cambia.sequence.st25.antlr.AParser.head(AParser.java:90)
    at org.cambia.sequence.st25.antlr.AParser.list(AParser.java:37)