[antlr-interest] ANTLR running out of memory while parsing huge files
Jim Idle
jimi at temporal-wave.com
Tue Apr 21 07:25:48 PDT 2009
Nick Vlassopoulos wrote:
> Hi,
>
> I am fairly new to ANTLR and I have come accross a problem.
> I have written a simple grammar to parse huge data files (several
> gigabytes each)
> and antlr seems to crash by running out of memory (I am using "C" as
> the target language).
>
> The data files have the general format:
> HEADER
> DECL
> BODY
> <several millions of lines here>
> END
>
> What seems to be the problem is that antlr tries to parse the whole
> data file
> at once. Is there a way to "force" parsing line by line? (at least for
> the "BODY" part?)
>
You will need to split the input into more manageable chunks yourself I
am afraid. When you start the parser it asks the lexer for the first
token, which causes the lexer to tokenize the entire input.
You can feed line by line by resetting the lexer and parser and
providing a new string stream with the pointer and lengths set
accordingly and hence a new token stream for the chunk you wish to parse
next. There is a relatively small overhead in doing this from C and it
is the same technique you would use to parse any chunk. If your input is
several gigabytes, then the standard technique of reading the whole file
at once and parsing it all at once would not be so useful anyway. In
your position i would write a custom input stream that performed
buffered reads on the file and returned EOF at strategic points, but
which could be reset (or maybe auto-reset) until the real EOF is found.
Your parser can retain state so you know where you are. At each EOF, you
can ask the input stream if it was really the end or just a fake end,
which you can then restart with. Make sure that you retain the input
stream for as long as you need to actualized the text of the tokens as
the tokens just point in to the input stream. However, you can set the
text explicitly or build up your output on the fly and so on.
Jim
More information about the antlr-interest
mailing list