[antlr-interest] ANTLR running out of memory while parsing huge files

Jim Idle jimi at temporal-wave.com
Tue Apr 21 08:11:08 PDT 2009


Nick Vlassopoulos wrote:
> Hi Jim!
>
> Thanks for your replies!!
>
> The input lines are of the form
> "var = data"
> so they are pretty simple!
> If I got this right, you suggest using something like a
> body_set :
>    body_start (probably a "greedy" option here?) body_end
> rule and then just add code to parse the intermediate lines (which are 
> pretty simple) manually??
Actually, do you need a parser? Perhaps you can do this all in the lexer 
and not create tokens for the data but just use the input stream in your 
own lexer action code.

But I was thinking this:

1) Copy my input stream code and name it for yourself;
2) Have it respond to LA() using buffered reads until it finds the token 
that starts the body, say it is 'BODY', then it returns EOF;
3) Invoke the parser/lexer/inputstream stack and it will set up the 
information you need for the incoming data and stop, the input stream 
remembers where it was;
4) Process the data using a little custom C code that works with the 
input stream until you see the data has ended, tell the input stream 
where to restart;
5) Tell the input stream to set up for the next header starting at the 
data end location. If it wasn't at real EOF, then go to 3)
6) End

It sounds more complicated written in an email than it will be in the C 
code ;-) You can also do the same thing without a custom input stream, 
but then you would be reading the entire file and pre-scanning and so on.

If your headers are pretty simple, you might also find that an awk 
script  or just plain C code is a better method ;-)

Jim


More information about the antlr-interest mailing list