[antlr-interest] Bounding the token stream in the C backend

Wed Mar 3 11:38:18 PST 2010

> > However, look guys, this is C!! By which I mean, for real efficiency,
> you should be accessing things such as the text of the token via the
> pointers in the token and not via the artifice of $text.
> 
> Thanks for this tip! By replacing
> 
>     std::string id( $IDENTIFIER.text->chars )
> 
> with
> 
>     pANTLR3_COMMON_TOKEN token = $IDENTIFIER;
>     ANTLR3_MARKER start = token->getStartIndex(token);
>     ANTLR3_MARKER end = token->getStopIndex(token);
>     std::string id( (const char *)start, end-start+1 );
>

But, do you really even need to create the string? Can you not just use the token and then if you ever actualize the text for something only copy it at that point?

> I see another 3-fold decrease in memory usage. In combination with the
> bounded lookahead stream and token factory, this brings the memory
> usage of my ANTLR 3 C parser roughly in line the ANTLR 2.7 C++ version
> (it's still ~40% faster).

It should be much better than that, so it tends to make me think that the overhead is in the other code you have surrounding the parser. You should try and do a comparison with no actions in either. However, perhaps you do not need to because once the parsing time is not really any part of the total time, you will get more performance by improving the action code of course.

> 
> > In the next release I will document this better and I apologize for
> not having done so up to press. There are also lots of macros and
> switches you can set that will improve performance a lot, and the
> upcoming release has lots of performance improvements. For comparison,
> I am currently working on a parser for IBM that is 7X faster than the
> 2.7.x C++ equivalent. Once again, I apologize for not documenting all
> of this stuff as well as it might be, but the code itself is well
> documented; there just needs to be more usage docs I think.
> 
> This is intriguing. Could you point to a few of the important settings
> I should be looking at?

Things such as not using method calls for LA() when you know you have 8 bit or 16 bit input (you can do this now, check your generated code or the C examples), turning off follow set stacking if you do not need fancy error messages but just wish to fail out or say "Syntax error at line 4". I also found some improvements in some of the runtime library and I have also implemented ->reuse() on all the objects up to the tree parsers. This means that you can let them accumulate the memory they need and then just reuse them for another parse, which loses all the malloc() calls; useful in things like servers. Look at the macro stuff in the generated code for more information.

Jim