[antlr-interest] Parsing Very Large Inputs

Sun Nov 26 11:05:40 PST 2006

Hi,

I'm using ANTLR v3 (3.0b5).

If I understand correctly, the Token sequence produced by an ANTLR 
lexical analyzer is retained throughout the parse by CommonTokenStream. 
Ordinarily, this is fine, but when parsing very large inputs, it can 
place an excessive demand on primary storage and become a limiting 
factor for the size of inputs that can be processed without the 
allocation of inordinate amounts of RAM.

It appears that TokenRewriteStream would allow me to discard old Token 
instances once they're no longer needed. In my parser, I do this in 
conjunction with collecting any comment Tokens that may appear between 
top-level constructs in the language.

So I switched to using a TokenRewriteStream and then invoked 
TokenRewriteStream.delete(0, newFirstTokenIndex) after parsing every 
top-level construct (where newFirstTokenIndex is the index of the first 
token in the top-level construct).

However, this does not seem to have the effect on RAM consumption I'd 
hoped. The JavaDoc comment on TokenRewriteStream says the manipulations 
it performs are carried out "lazily," so I added a call to 
TokenRewriteStream.toString(0, 1) after the delete(...) call. When I 
print this string it shows the text of the newFirstTokenIndex, which 
seems correct.

With this modification, the parse continues normally but it does not 
appear memory use is significantly reduced. I noticed that the token 
indexes associated with the left-hand token of successive top-level 
constructs increase as if no Token deletion was performed, though I'm 
guessing that's as intended.

Apparently, the Token structures remain and only the text to which they 
refer is discarded. It also seems, based on the comments for 
TokenRewriteStream that a lot of bookkeeping is put in place to record 
the manipulations.

Is there a simpler way to simply and completely discard old Tokens?

Randall Schulz