[antlr-interest] Parsing Very Large Inputs
Terence Parr
parrt at cs.usfca.edu
Sun Nov 26 11:48:27 PST 2006
hi. Those streams never throw tokens out. It creates text based
upon the tokens in the buffer and operations you perform, but it
never does a delete...it just skips them during printing.
Implement TokenStream interface and just don't keep the tokens
around. I think it's just nextToken() and LT() or something.
Ter
On Nov 26, 2006, at 11:05 AM, Randall R Schulz wrote:
> Hi,
>
> I'm using ANTLR v3 (3.0b5).
>
> If I understand correctly, the Token sequence produced by an ANTLR
> lexical analyzer is retained throughout the parse by
> CommonTokenStream.
> Ordinarily, this is fine, but when parsing very large inputs, it can
> place an excessive demand on primary storage and become a limiting
> factor for the size of inputs that can be processed without the
> allocation of inordinate amounts of RAM.
>
> It appears that TokenRewriteStream would allow me to discard old Token
> instances once they're no longer needed. In my parser, I do this in
> conjunction with collecting any comment Tokens that may appear between
> top-level constructs in the language.
>
> So I switched to using a TokenRewriteStream and then invoked
> TokenRewriteStream.delete(0, newFirstTokenIndex) after parsing every
> top-level construct (where newFirstTokenIndex is the index of the
> first
> token in the top-level construct).
>
> However, this does not seem to have the effect on RAM consumption I'd
> hoped. The JavaDoc comment on TokenRewriteStream says the
> manipulations
> it performs are carried out "lazily," so I added a call to
> TokenRewriteStream.toString(0, 1) after the delete(...) call. When I
> print this string it shows the text of the newFirstTokenIndex, which
> seems correct.
>
> With this modification, the parse continues normally but it does not
> appear memory use is significantly reduced. I noticed that the token
> indexes associated with the left-hand token of successive top-level
> constructs increase as if no Token deletion was performed, though I'm
> guessing that's as intended.
>
> Apparently, the Token structures remain and only the text to which
> they
> refer is discarded. It also seems, based on the comments for
> TokenRewriteStream that a lot of bookkeeping is put in place to record
> the manipulations.
>
> Is there a simpler way to simply and completely discard old Tokens?
>
>
> Randall Schulz
More information about the antlr-interest
mailing list