[antlr-interest] BitSet and big charVocabulary in C++

Fri Feb 16 06:32:45 PST 2007

Hi,

On 2/16/07, Vitaliy Akimov <vitaliy.akimov at gmail.com> wrote:
> Hi, I'm implementing unicode lexer using antlr v.2.7 (for C++). And
> I've found annoyance with patterns which translated to BitSet. Using
> big vocabulary lead to spending sensible amount of time for BitSet
> construction.

By reusing Lexer/Parser objects and just refreshing the inputbuffer
you can make sure this initialization is only done once for a set of
files.

> Extending codeGenBitsetTestThreshold gets large
> condition line in "if" statement (millions of symbols). Why doesn't
> antlr generate conditions with not operator (!) in simple expressions
> such as "(~ ('a'| 'z'))" ?

I think Terence could shed more light on that. This is not decided in
the C++ codegen, although the BitSet class could be made smarter.

> And why does antlr copy generated bitset
> from array of longs to vector<bool> which is very time consuming?

Legacy, a lot of C++ classes are direct ports of existing java classes
and not the most optimal solution to the problem. Since most
parsers/lexers do not contain very large bitsets this problem has not
become very apparent (yet). Unicode is more or less experimental in
C++ mode.

> I think it's more reasonable use reference to this packed array and
> unpack bits in match function.

I agree. If you would create a patch for the bitset class to fix this
I'll be happy to include it in the next release.

Cheers,

Ric