[antlr-interest] unicode 16bit versus new 21bit stuff
pete.forman at westerngeco.com
Mon Jun 21 04:44:26 PDT 2004
At 2004-06-18 17:49 -0700, Terence Parr wrote:
>I thought I was going to be able to get away with 16bit unicode values
>as Java seems to encode the "supplemental" crud via UTF-16 in char
>arrays / strings. But, now I see in Character that they are adding
>methods with int not char arguments to handle the beyond 16bit stuff.
The 32 bit int stuff is for low levels only. At higher levels,
including character and string literals in the language specification,
the encoding is UTF-16.
Consider other languages as well. C# is restricted to 16 bit
characters. The wchar_t in C++ is too vague; this actually leaves you
free to work in a width of your choosing.
>My analysis algorithms use pure int so there is no trouble with that,
>however, I do encode token types in the upper 16 bits of a 32 int and
>have all chars in the lower 16 bits. This is purely programming
>convenience as I know how to print out a token type by it's value
>range. I don't want to go to 64-bit ints as most CPUs are still 32bits
>natively. If I use 21-bit unicode values, that leaves 2^11 or 2048
>token type values, which makes me a bit nervous.
Future Unicode versions may eat away at those 11 bits. Also someone
might want to work with ISO 10646-1 which uses 31 bits.
>I want to do unicode "right" this time. Anybody have a strong opinion
>about the new supplemental (beyond 16bit unicode) char values and/or
>whether 2048 is a serious token type limitation?
Unicode went to 21 bits three years ago. Java is only just catching
up. There is a new version of C# due out soon which may follow suit.
Limiting code points to 16 bits is definitely out and making
programmers explicitly code surrogates is not desirable.
>The new system will be cool. You'll be able to use
>Character.UnicodeBlock stuff such as vocabulary=BENGALI;
That sounds good but consider what should be done about code points
that are undefined. They may get added by a later version of Unicode.
What may be an error at build time might become legal later.
I would like to see a distinction made between the encodings used for
1) writing grammars (.g files)
2) the generated code
3) the input stream
It should be possible to use a mathematical symbol from the SMP within
a string in the grammar and have it show as a single glyph in your
editor. The generated code would need to use two characters to hold
it (unless a type wider than char was used).
The most natural encoding to use for the input stream is UTF-16 in
Java or C#. I wonder whether there would be any mileage in providing
options to generate code to work on UTF-8 or UTF-32. Another way of
looking at that is whether the lexer is being fed by a ByteScanner,
CharScanner or CodePointScanner. The first two require that the
lexer groks surrogates, and the first also needs UTF-8.
I'd stop there on encodings. Others such as ISO 8859-* and Shift JIS
are better left to other modules to translate to Unicode.
Another Unicode issue that has not been raised is normalization, and
the four choices for doing it. An input may use one character or
several to represent an accented letter. We could for example choose
to use NFC in the generated code and arrange for the input stream to
be normalized in the same way.
Both the encoding and normalization make it tricky to talk about
column numbers in the input. In general there will be an index into
the raw bytes or chars, and another for the decoded/normalized code
Pete Forman -./\.- Disclaimer: This post is originated
WesternGeco -./\.- by myself and does not represent
pete.forman at westerngeco.com -./\.- opinion of Schlumberger, Baker
http://petef.port5.com -./\.- Hughes or their divisions.
Yahoo! Groups Links
<*> To visit your group on the web, go to:
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
More information about the antlr-interest