[antlr-interest] C++ and Unicode

Martin Probst mail at martin-probst.com
Mon Aug 16 15:07:11 PDT 2004


Hello,
thanks for your help. I've already thought of the possibility of
"emulating" an UTF-8 aware lexer with the current ANTLR release though
it seems some kind of unclean to me (even though it's not really that
much more work I guess). 
I would love to try out any Unicode-aware releases. If you can supply
some, I would like to test it with an XQuery grammar.

Regards,
Martin Probst

Am Mo, den 16.08.2004 schrieb Ric Klaren um 11:57:
> On Mon, Aug 16, 2004 at 12:36:16PM +0300, Ruslan Zasukhin wrote:
> > Yes we also think that UTF8 should be the first step to unicode world.
> > And it looks to be relatively easy step.
> 
> I got it it going and indeed it was easier as I expected. But there's
> probably still a few 'sore' spots where wrong assumptions are made or the
> codegenerator needs tweaking. Generating grammars with a big charVocab is
> slow due to the bitset generating. Also the space bitsets currently take is
> a bit too much. I guess I can get a nice initial reduction by stripping
> leading/trailing zero's of the bitsets.
> 
> > For example for our SQL grammar, we have:
> >
> > A) keywords -- always English
> 
> I did not test literal table testing yet. But since this is all contained
> in one or two methods it will be easy to override/tweak.
> 
> > B) identifiers -- we want/can extract them as UTF8 strings, which later we
> > will self convert to UTF16. Identifiers this is e.g. Name of table or field.
> >
> > C) string constants --
> >
> >     fld = 'affjsdfhjkfhjksdhf '
> >
> > It also can be extracted in UTF8, and later we will convert it to UF16.
> >
> > We need convert to UTFq6, because we use IBV ICU library, so all our
> > internal algorithms work in UTF16.
> 
> Currently I use a custom CharBuffer that decodes UTF8 to integers. Changing
> it to do autodetecting or UTF16 is easy. Inside the lexer
> (UnicodeCharScanner baseclass) most checks are done in 32 bit int values.
> Since antlr builds the strings that are passed to the parser per character
> you only have to override one append method to do the encoding for the
> backend. Just to have something there I encode the 32 bit values back to
> UTF8 and store them in a std::string storing them into UTF16 would be near
> trivial I guess ;)
> 
> I hope I can release this stuff soon so some people who actually use this
> stuff can see what it does and what needs tweaking/fixing.
> 
> Cheers,
> 
> Ric
> --
> -----+++++*****************************************************+++++++++-------
>     ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893755  ----
> -----+++++*****************************************************+++++++++-------
>      Innovation makes enemies of all those who prospered under the old
>    regime, and only lukewarm support is forthcoming from those who would
>                prosper under the new. --- Niccolò Machiavelli
> 
> 
> 
>  
> Yahoo! Groups Links
> 
> 
> 
>  
> 
> 




 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list