[antlr-interest] Bounding the token stream in the C backend

Tue Mar 2 23:36:49 PST 2010

Your points are well taken, but there are some misunderstandings here. I have tried to say on a number of occasions that the string factory thing coming from $xyz.text is just a convenience thing that should not be used in any environment that requires performance/memory efficiency. It is just a reasonable way to access the pointers and stuff.

However, look guys, this is C!! By which I mean, for real efficiency, you should be accessing things such as the text of the token via the pointers in the token and not via the artifice of $text. In the next release I will document this better and I apologize for not having done so up to press. There are also lots of macros and switches you can set that will improve performance a lot, and the upcoming release has lots of performance improvements. For comparison, I am currently working on a parser for IBM that is 7X faster than the 2.7.x C++ equivalent. Once again, I apologize for not documenting all of this stuff as well as it might be, but the code itself is well documented; there just needs to be more usage docs I think.

Thanks for the feedback,

jim

-----Original Message-----
From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Christopher L Conway
Sent: Tuesday, March 02, 2010 6:44 PM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Bounding the token stream in the C backend

On Thu, Feb 25, 2010 at 12:09 AM, Christopher L Conway
<cconway at cs.nyu.edu> wrote:
> I've got a large input file (~39MB) that I'm attempting to parse with
> an ANTLR3-generated C parser. The parser is using a huge amount of
> memory (~3.7GB) and seems to start thrashing without making much
> progress towards termination. I found a thread from earlier this month
> (http://markmail.org/message/jfngdd2ci6h7qrbo) suggesting the most
> likely cause of such behavior is a parser bug, but I've stepped
> through the code and it seems to be lexing just fine. Rather, it seems
> the problem is that fillBuffer() is tokenizing the whole file in one
> go; then, the parsing rules slow to a crawl because the token buffer
> is sitting on all the memory.
>
> I wonder if there is a way to change fillBuffer()'s behavior, so that
> it will only lex some bounded number of tokens before allowing parsing
> to proceed?

I have a partial solution to this problem. To be clear, the issue is:

1. The default token stream implementation tokenizes the entire input
on the first call to LT().
2. The default token factory never de-allocates tokens.

Since a token structure is more than 100 bytes, large inputs can
easily consume multiple GB before parsing even begins. (This is in the
C back-end. I don't know about other back-ends.)

The solution consists of:

1. A token stream implementation that tokenizes up to a fixed lookahead limit.
2. A token factory that pre-allocates a fixed number of tokens,
recycling old tokens when new ones are requested.

This seems to be a sound strategy, so long as the input grammar has an
known lookahead limit and the allocation pool is sufficiently large.
My grammar is LL(2), and a lookahead limit of 2 with a token pool of 8
tokens works just fine.

Using this implementation, I'm able to parse the above-mentioned 39MB
input file using less than 700MB memory, a more than 5-fold
improvement on the defaults (as an added benefit, the parser actually
terminated after 45s and didn't completely lock my system!). The
parser is about 40% faster than an equivalent ANTLR 2.7 parser using
the C++ back-end, but still uses about 5 times as much memory. The
remaining excess memory usage seems to be due to the default string
factory implementation, which also doesn't seem to ever release memory
once allocated. This is a much more complex beast and I'm hesitant to
tackle it.

If anybody is interested in using this code, I'm willing to clean it
up and share it with the community. Please feel free to contact me.

Regards,
Chris

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address