[antlr-interest] Parsing Very Large Inputs

Sun Nov 26 11:48:27 PST 2006

hi.  Those streams never throw tokens out.  It creates text based  
upon the tokens in the buffer and operations you perform, but it  
never does a delete...it just skips them during printing.

Implement TokenStream interface and just don't keep the tokens  
around.  I think it's just nextToken() and LT() or something.

Ter
On Nov 26, 2006, at 11:05 AM, Randall R Schulz wrote:

> Hi,
>
> I'm using ANTLR v3 (3.0b5).
>
> If I understand correctly, the Token sequence produced by an ANTLR
> lexical analyzer is retained throughout the parse by  
> CommonTokenStream.
> Ordinarily, this is fine, but when parsing very large inputs, it can
> place an excessive demand on primary storage and become a limiting
> factor for the size of inputs that can be processed without the
> allocation of inordinate amounts of RAM.
>
> It appears that TokenRewriteStream would allow me to discard old Token
> instances once they're no longer needed. In my parser, I do this in
> conjunction with collecting any comment Tokens that may appear between
> top-level constructs in the language.
>
> So I switched to using a TokenRewriteStream and then invoked
> TokenRewriteStream.delete(0, newFirstTokenIndex) after parsing every
> top-level construct (where newFirstTokenIndex is the index of the  
> first
> token in the top-level construct).
>
> However, this does not seem to have the effect on RAM consumption I'd
> hoped. The JavaDoc comment on TokenRewriteStream says the  
> manipulations
> it performs are carried out "lazily," so I added a call to
> TokenRewriteStream.toString(0, 1) after the delete(...) call. When I
> print this string it shows the text of the newFirstTokenIndex, which
> seems correct.
>
> With this modification, the parse continues normally but it does not
> appear memory use is significantly reduced. I noticed that the token
> indexes associated with the left-hand token of successive top-level
> constructs increase as if no Token deletion was performed, though I'm
> guessing that's as intended.
>
> Apparently, the Token structures remain and only the text to which  
> they
> refer is discarded. It also seems, based on the comments for
> TokenRewriteStream that a lot of bookkeeping is put in place to record
> the manipulations.
>
> Is there a simpler way to simply and completely discard old Tokens?
>
>
> Randall Schulz