[antlr-interest] Unicode & C++ & ANTLR2 (and a bit 3)

Ric Klaren klaren at cs.utwente.nl
Tue Jul 6 06:56:41 PDT 2004


Hi,

Under the motto of understanding through programming I tried to fix up
ANTLR2 to be able to read UTF8 in C++ mode.

This actually turned out to be easier as I thought (well the groundwork at
least). Add one custom InputBuffer that translates UTF8 to unicode
codepoints (unsigned ints). Add a custom lexer superclass that after lexing
reencodes the codepoints to UTF8 and moves them off to the parser in the
'normal' std::string embedded in CommonToken. (With more extensive
tinkering this could also be a custom token class with wchar_t payload)

Some observations/conclusions:

- I'd definitely like to get int arrays or int vectors for the strings in
  codegen (with pure unicode codepoints).

ANTLR2 needs to support 32 bit escapes in the the lexer to support full
unicode. (currently can't specify values above \uFFFF could opt to
introduce a new escape syntax that support variable length hex values
\u{(HEXDIGIT)+} or something)

 - Question can ANTLR 2's analysis engine deal with such values?

Current bitset generation for unicode is quite expensive (parse times are
long because of bitset generation)

There's quite some interesting choices with respect to how to add the UTF
decoding encoding to the scanner classes.

- What implementations for wchar_t are common? Are all 32 bit based or are
  there some 16 bitters.
- How do these deal with encoding string/char constants defined with
  L"string" and L'c'.

I guess we could gain lexer speed by dropping support for the ! operator in
a lexer (or restrict it's use to the start and end of the token text, that
way we could probably only deal with indexes or pointers in stead of all
the copying that is happening now).

Cheers,

Ric
--
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893755  ----
-----+++++*****************************************************+++++++++-------
 "Don't call me stupid." "Oh, right. To call you stupid would be an insult
    to stupid people. I've known sheep that could outwit you! I've worn
              dresses with higher IQs!" --- A Fish Called Wanda



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list