[antlr-interest] ANTLR running out of memory while parsing huge files

Tue Apr 21 07:25:48 PDT 2009

Nick Vlassopoulos wrote:
> Hi,
>
> I am fairly new to ANTLR and I have come accross a problem.
> I have written a simple grammar to parse huge data files (several 
> gigabytes each)
> and antlr seems to crash by running out of memory (I am using "C" as 
> the target language).
>
> The data files have the general format:
> HEADER
>  DECL
> BODY
>  <several millions of lines here>
> END
>
> What seems to be the problem is that antlr tries to parse the whole 
> data file
> at once. Is there a way to "force" parsing line by line? (at least for 
> the "BODY" part?)
>
You will need to split the input into more manageable chunks yourself I 
am afraid. When you start the parser it asks the lexer for the first 
token, which causes the lexer to tokenize the entire input.

You can feed line by line by resetting the lexer and parser and 
providing a new string stream with the pointer and lengths set 
accordingly and hence a new token stream for the chunk you wish to parse 
next. There is a relatively small overhead in doing this from C and it 
is the same technique you would use to parse any chunk. If your input is 
several gigabytes, then the standard technique of reading the whole file 
at once and parsing it all at once would not be so useful anyway. In 
your position i would write a custom input stream that performed 
buffered reads on the file and returned EOF at strategic points, but 
which could be reset (or maybe auto-reset) until the real EOF is found. 
Your parser can retain state so you know where you are. At each EOF, you 
can ask the input stream if it was really the end or just a fake end, 
which you can then restart with. Make sure that you retain the input 
stream for as long as you need to actualized the text of the tokens as 
the tokens just point in to the input stream. However, you can set the 
text explicitly or build up your output on the fly and so on.

Jim