[antlr-interest] BENCHMARK. ANTLR. Bad results.

Fri Nov 12 16:10:12 PST 2004

On Sat, Nov 13, 2004 at 01:05:03AM +0200, Ruslan Zasukhin wrote:
> They are not good :-(

Not really surprised :(

> * We will try urgently switch to Lex lexer.

This will probably get you a good increase in speed going on previous
benchmarks.

> * but IMHO on of the main problems of ANTLR C++, is that it heavally uses
> std::string class and a lots of *copying* of string when it parse tokens.

It copies indeed a lot. Among others the penalty for the ! operator in the
lexer. But also a lot of simple tokens could do without text (like for
operators, constant keywords and such).

> * may be another problem is in exceptions, although I like mechanizm of
> exceptions.

It depends on your grammar probably how much performance the exceptions
will kill.

> I think that for 3.0 Rick and community should produce own, special, very
> optimized for ANTLR tasks antl::string class. This class should simply have
> 2 pointers on start/end of tokem. Pointer must point directly into original
> parsed text. There is no need copy any byte of parsed text. Everything must
> work on pointers. Tell me that I am wrong?! :-)

The current C prototype uses mmap and during parsing indices are used into
the mmap'd buffer. So far it only uses numbers for tokens. I put of the
decision on how the text goes into the token, now the text is available in
a temporary buffer and it's now up to the programmer to copy it (note still
prototype). I'd love to make that somehow configurable (by template
switching or some other tricks).

The C prototype does not use exceptions for error handling but return
values, this again leads to loads of if(ok) checks. If this works
out/performs well I'll copy that approach to C++ mode as well. I'm
currently working on glueing a prototype C java lexer to the antlr2 parser
to get some comparison results to the current lexer.

The prototype generates for now selfcontained lexer/parsers (e.g. no
support lib). It should also be fully reentrant.

The stringtemplates currently still impose a structure that

1. reads input and decide which way to go. (Done via simpele nested checks
   or via a DFA built with goto's in the case of infinite prefixes)
2. go that way and reread the input (and actually once more do the check
   already done in step 1)

Yet the impact with the current prototype should be less compared to the
lexers in antlr2 (and to some extent you can customize the templates to
inline more). Ter also mentioned that the most obvious of those read and
reread cases could be optimized. In the case of big lookahead for a
prediction the current setup could lead to trashing.

A problem with the mmap approach is though that you can't parse files that
are ridiculously big (I think.. not 100% sure there) and problems may arise
with OS'es that do not have it (then again you can read it directly into a
buffer and index that).

With the ANTLR3 C++ support code I'd rather make it easy to plug in own
custom string classes than supply another string library that needs to be
maintained (although people have offered custom string libraries) Also the
unicode story will crop up pretty quick when we talk string classes.

I would prefer to have an even smaller core support lib than we have now
(and provide more of the advanced features via example code).

Cheers,

Ric
--
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893755  ----
-----+++++*****************************************************+++++++++-------
  "Of all the things I've lost I miss my mind the most --- Ozzy Osbourne

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/