[antlr-interest] C++ and Unicode

Ric Klaren klaren at cs.utwente.nl
Mon Aug 16 02:57:50 PDT 2004


On Mon, Aug 16, 2004 at 12:36:16PM +0300, Ruslan Zasukhin wrote:
> Yes we also think that UTF8 should be the first step to unicode world.
> And it looks to be relatively easy step.

I got it it going and indeed it was easier as I expected. But there's
probably still a few 'sore' spots where wrong assumptions are made or the
codegenerator needs tweaking. Generating grammars with a big charVocab is
slow due to the bitset generating. Also the space bitsets currently take is
a bit too much. I guess I can get a nice initial reduction by stripping
leading/trailing zero's of the bitsets.

> For example for our SQL grammar, we have:
>
> A) keywords -- always English

I did not test literal table testing yet. But since this is all contained
in one or two methods it will be easy to override/tweak.

> B) identifiers -- we want/can extract them as UTF8 strings, which later we
> will self convert to UTF16. Identifiers this is e.g. Name of table or field.
>
> C) string constants --
>
>     fld = 'affjsdfhjkfhjksdhf '
>
> It also can be extracted in UTF8, and later we will convert it to UF16.
>
> We need convert to UTFq6, because we use IBV ICU library, so all our
> internal algorithms work in UTF16.

Currently I use a custom CharBuffer that decodes UTF8 to integers. Changing
it to do autodetecting or UTF16 is easy. Inside the lexer
(UnicodeCharScanner baseclass) most checks are done in 32 bit int values.
Since antlr builds the strings that are passed to the parser per character
you only have to override one append method to do the encoding for the
backend. Just to have something there I encode the 32 bit values back to
UTF8 and store them in a std::string storing them into UTF16 would be near
trivial I guess ;)

I hope I can release this stuff soon so some people who actually use this
stuff can see what it does and what needs tweaking/fixing.

Cheers,

Ric
--
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893755  ----
-----+++++*****************************************************+++++++++-------
     Innovation makes enemies of all those who prospered under the old
   regime, and only lukewarm support is forthcoming from those who would
               prosper under the new. --- Niccolò Machiavelli



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list