[antlr-interest] Bounding the token stream in the C backend

Tue Mar 2 18:44:13 PST 2010

On Thu, Feb 25, 2010 at 12:09 AM, Christopher L Conway
<cconway at cs.nyu.edu> wrote:
> I've got a large input file (~39MB) that I'm attempting to parse with
> an ANTLR3-generated C parser. The parser is using a huge amount of
> memory (~3.7GB) and seems to start thrashing without making much
> progress towards termination. I found a thread from earlier this month
> (http://markmail.org/message/jfngdd2ci6h7qrbo) suggesting the most
> likely cause of such behavior is a parser bug, but I've stepped
> through the code and it seems to be lexing just fine. Rather, it seems
> the problem is that fillBuffer() is tokenizing the whole file in one
> go; then, the parsing rules slow to a crawl because the token buffer
> is sitting on all the memory.
>
> I wonder if there is a way to change fillBuffer()'s behavior, so that
> it will only lex some bounded number of tokens before allowing parsing
> to proceed?

I have a partial solution to this problem. To be clear, the issue is:

1. The default token stream implementation tokenizes the entire input
on the first call to LT().
2. The default token factory never de-allocates tokens.

Since a token structure is more than 100 bytes, large inputs can
easily consume multiple GB before parsing even begins. (This is in the
C back-end. I don't know about other back-ends.)

The solution consists of:

1. A token stream implementation that tokenizes up to a fixed lookahead limit.
2. A token factory that pre-allocates a fixed number of tokens,
recycling old tokens when new ones are requested.

This seems to be a sound strategy, so long as the input grammar has an
known lookahead limit and the allocation pool is sufficiently large.
My grammar is LL(2), and a lookahead limit of 2 with a token pool of 8
tokens works just fine.

Using this implementation, I'm able to parse the above-mentioned 39MB
input file using less than 700MB memory, a more than 5-fold
improvement on the defaults (as an added benefit, the parser actually
terminated after 45s and didn't completely lock my system!). The
parser is about 40% faster than an equivalent ANTLR 2.7 parser using
the C++ back-end, but still uses about 5 times as much memory. The
remaining excess memory usage seems to be due to the default string
factory implementation, which also doesn't seem to ever release memory
once allocated. This is a much more complex beast and I'm hesitant to
tackle it.

If anybody is interested in using this code, I'm willing to clean it
up and share it with the community. Please feel free to contact me.

Regards,
Chris