[antlr-interest] my brief thoughts on performance

Thu Feb 24 16:44:24 PST 2005

Howdy,

Thanks very much to everyone for their thoughts regarding the 
performance of the ANTLR-generated lexers.

First, ANTLR 2.x lexers unfortunately never had speed as a primary 
concern.  Still I apologize for how slow they have indeed become!  
Yikes!  I was taking a risk going to recursive-descent lexers; I like 
them but my original implementation was slow.

Next, I agree that much can be done to increase speed by tailoring the 
lexer methods like LA and so on to be suitable for your application.  
Even in general I think much could be done.  I looked at it last year 
when Sriram S. told me that he got a huge improvement, but I didn't see 
an obvious way to be general and do what he did.  The conclusion I have 
for 2.x is that you must rely on non-general user modifications to make 
it go fast; i.e., you have to do the tweaks, but perhaps we can 
summarize some useful tweaks.  Perhaps a quick change to the code 
generators will allow caching of LA(i).

For the future, speed is a concern.  To that end, I have taken a number 
of steps for v3.0:

o Token objects don't create strings nor will trees copy stuff out of 
tokens.

o LA(i) is much simpler as everything is assumed to be buffered up 
(unless you use a different char stream object)

o I try to use LA(i) exactly once per decision

o the DFA-based decision structure is doing one computation to pick an 
alt instead of n computations for n alts.

o i'm generating bytecodes directly (for the java target) so that I can 
access the goto instruction.  This is required for the cyclic DFAs that 
use arbitrary lookahead.

o as people have pointed out, the code generation structure of 3.0 is 
really flexible.  By saying target=JohnsTarget, it will load 
JohnsTarget StringTemplate group instead of the default Java target or 
whatever.  This can be a tweaked Java or C or whatever template file.

o i have not done this yet, but I anticipate many fewer redundant 
method calls in the lexer for simple tokens like semicolon.

Just to be clear, speed is a concern, but I also care a great deal 
about having ANTLR do most things "out of the box" including things 
like asking for the start/stop token matched by a rule etc...  This 
extra weight can be trimmed by people like Ruslan easily but I want it 
to do the common things easily.

Best regards,
Terence
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com