[antlr-interest] Bounding the token stream in the C backend

Wed Mar 3 11:19:39 PST 2010

Hi, Jim,

On Wed, Mar 3, 2010 at 2:36 AM, Jim Idle <jimi at temporal-wave.com> wrote:
> Your points are well taken, but there are some misunderstandings here. I have tried to say on a number of occasions that the string factory thing coming from $xyz.text is just a convenience thing that should not be used in any environment that requires performance/memory efficiency. It is just a reasonable way to access the pointers and stuff.
>
> However, look guys, this is C!! By which I mean, for real efficiency, you should be accessing things such as the text of the token via the pointers in the token and not via the artifice of $text.

Thanks for this tip! By replacing

    std::string id( $IDENTIFIER.text->chars )

with

    pANTLR3_COMMON_TOKEN token = $IDENTIFIER;
    ANTLR3_MARKER start = token->getStartIndex(token);
    ANTLR3_MARKER end = token->getStopIndex(token);
    std::string id( (const char *)start, end-start+1 );

I see another 3-fold decrease in memory usage. In combination with the
bounded lookahead stream and token factory, this brings the memory
usage of my ANTLR 3 C parser roughly in line the ANTLR 2.7 C++ version
(it's still ~40% faster).

> In the next release I will document this better and I apologize for not having done so up to press. There are also lots of macros and switches you can set that will improve performance a lot, and the upcoming release has lots of performance improvements. For comparison, I am currently working on a parser for IBM that is 7X faster than the 2.7.x C++ equivalent. Once again, I apologize for not documenting all of this stuff as well as it might be, but the code itself is well documented; there just needs to be more usage docs I think.

This is intriguing. Could you point to a few of the important settings
I should be looking at?

Regards,
Chris

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Christopher L Conway
> Sent: Tuesday, March 02, 2010 6:44 PM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Bounding the token stream in the C backend
>
> On Thu, Feb 25, 2010 at 12:09 AM, Christopher L Conway
> <cconway at cs.nyu.edu> wrote:
>> I've got a large input file (~39MB) that I'm attempting to parse with
>> an ANTLR3-generated C parser. The parser is using a huge amount of
>> memory (~3.7GB) and seems to start thrashing without making much
>> progress towards termination. I found a thread from earlier this month
>> (http://markmail.org/message/jfngdd2ci6h7qrbo) suggesting the most
>> likely cause of such behavior is a parser bug, but I've stepped
>> through the code and it seems to be lexing just fine. Rather, it seems
>> the problem is that fillBuffer() is tokenizing the whole file in one
>> go; then, the parsing rules slow to a crawl because the token buffer
>> is sitting on all the memory.
>>
>> I wonder if there is a way to change fillBuffer()'s behavior, so that
>> it will only lex some bounded number of tokens before allowing parsing
>> to proceed?
>
> I have a partial solution to this problem. To be clear, the issue is:
>
> 1. The default token stream implementation tokenizes the entire input
> on the first call to LT().
> 2. The default token factory never de-allocates tokens.
>
> Since a token structure is more than 100 bytes, large inputs can
> easily consume multiple GB before parsing even begins. (This is in the
> C back-end. I don't know about other back-ends.)
>
> The solution consists of:
>
> 1. A token stream implementation that tokenizes up to a fixed lookahead limit.
> 2. A token factory that pre-allocates a fixed number of tokens,
> recycling old tokens when new ones are requested.
>
> This seems to be a sound strategy, so long as the input grammar has an
> known lookahead limit and the allocation pool is sufficiently large.
> My grammar is LL(2), and a lookahead limit of 2 with a token pool of 8
> tokens works just fine.
>
> Using this implementation, I'm able to parse the above-mentioned 39MB
> input file using less than 700MB memory, a more than 5-fold
> improvement on the defaults (as an added benefit, the parser actually
> terminated after 45s and didn't completely lock my system!). The
> parser is about 40% faster than an equivalent ANTLR 2.7 parser using
> the C++ back-end, but still uses about 5 times as much memory. The
> remaining excess memory usage seems to be due to the default string
> factory implementation, which also doesn't seem to ever release memory
> once allocated. This is a much more complex beast and I'm hesitant to
> tackle it.
>
> If anybody is interested in using this code, I'm willing to clean it
> up and share it with the community. Please feel free to contact me.
>
> Regards,
> Chris
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>