[antlr-interest] antlr 3 unicode support

Mon Nov 24 03:20:54 PST 2003

At 2003-11-22 15:56 +0100, Ric Klaren wrote:
>Hi,
>
>On Fri, Nov 21, 2003 at 01:06:14AM -0500, Tom Moog wrote:
> > Just to remind you java guys thinking about lexing unicode:
> > unicode doesn't stop at 2**16.  It extends up to 0x10ffff if you
> > want to include music symbols, Babylonian, math style letters,
> > and so on.  For java xml parsers this means using the low order
> > ten bits from two adjacent 16 bit words (surrogate pairs) to
> > reach things above 2**16.
>
>Hmmm this would mean that we would have to deal with unicode decoding
>ourselves in the lexer? And use 32 bit values for the tokens/strings.
>
>So far for C++ I was only looking at making the backend wchar/wstring
>aware, although extending up from there would not be hard with
>templates.

In Java you still need to handle Unicode yourself.  JDK 1.4 is aligned
with Unicode 3.0 which was the last 16 bit (BMP only) version.  You
can get some extra mileage by treating the chars as UTF-16.  That
works if you skip over decoding surrogates and assume that they are
paired correctly.  A bit more effort is needed to discriminate
syntactically within plane 1+.

In C++ wchar_t/wstring may be 8, 16 or 32 bits depending on the
implementation.  To guarantee the width you'd need to roll your own
basic_string<int32_t> or use a library like ICU.  ICU4C has similar
constraints to Java: UnicodeStrings have two elements for each code
point outside plane 0.  It is up to the programmer to maintain
correctness of UTF-16.

-- 
Pete Forman                -./\.-  Disclaimer: This post is originated
WesternGeco                  -./\.-   by myself and does not represent
pete.forman at westerngeco.com    -./\.-   opinion of Schlumberger, Baker
http://petef.port5.com          -./\.-   Hughes or their divisions.

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/