[antlr-interest] parsing a very large file

Thu Mar 26 14:06:11 PDT 2009

At 05:10 27/03/2009, Vladimir Konrad wrote:
 >
 >I have read in the ANTLR book (from pragmatic book-store) that
 >ANTLR always loads entire file/stream into memory. Is this
 >still the case?

Yes, by default.

 >I would need to load a data file which is quite large (100MB+) 
but
 >parsing it with ANTLR uses over 1GB of RAM. Is there any way to
 >use ANTLR to load such a data file without so big memory 
consumption?
 >If not what other (java) options are there?

Have a look in the Wiki.  I believe that there's some info there 
regarding overriding the token stream to not preload all the 
tokens; that will keep the initial memory budget down.

However I don't think there's any way to avoid having it load 
everything in simultaneously at some point -- the tokens don't 
actually copy the data, they just hold references to its position 
in the input stream (and of course even if they did copy it that'd 
still end up taking up just as much memory).  You certainly 
wouldn't be able to produce an AST without having the entire input 
file in memory and tokenised.

Is there some way you can split up the input externally from ANTLR 
first?  I've used it before to parse some large data files 
(~80MB), but they were archive files that contained chunks of 
about ~500KB each that could be parsed independently of each 
other, so it was fairly straightforward.