[antlr-interest] Parsing Very Large Inputs
Randall R Schulz
rschulz at sonic.net
Sun Nov 26 11:05:40 PST 2006
Hi,
I'm using ANTLR v3 (3.0b5).
If I understand correctly, the Token sequence produced by an ANTLR
lexical analyzer is retained throughout the parse by CommonTokenStream.
Ordinarily, this is fine, but when parsing very large inputs, it can
place an excessive demand on primary storage and become a limiting
factor for the size of inputs that can be processed without the
allocation of inordinate amounts of RAM.
It appears that TokenRewriteStream would allow me to discard old Token
instances once they're no longer needed. In my parser, I do this in
conjunction with collecting any comment Tokens that may appear between
top-level constructs in the language.
So I switched to using a TokenRewriteStream and then invoked
TokenRewriteStream.delete(0, newFirstTokenIndex) after parsing every
top-level construct (where newFirstTokenIndex is the index of the first
token in the top-level construct).
However, this does not seem to have the effect on RAM consumption I'd
hoped. The JavaDoc comment on TokenRewriteStream says the manipulations
it performs are carried out "lazily," so I added a call to
TokenRewriteStream.toString(0, 1) after the delete(...) call. When I
print this string it shows the text of the newFirstTokenIndex, which
seems correct.
With this modification, the parse continues normally but it does not
appear memory use is significantly reduced. I noticed that the token
indexes associated with the left-hand token of successive top-level
constructs increase as if no Token deletion was performed, though I'm
guessing that's as intended.
Apparently, the Token structures remain and only the text to which they
refer is discarded. It also seems, based on the comments for
TokenRewriteStream that a lot of bookkeeping is put in place to record
the manipulations.
Is there a simpler way to simply and completely discard old Tokens?
Randall Schulz
More information about the antlr-interest
mailing list