[antlr-interest] unicode strings using supplemental char range
Terence Parr
parrt at cs.usfca.edu
Sun Jun 20 11:52:36 PDT 2004
Ok, another question about unicode and strings. With 16-bit unicode
and 16-bit chars in Java, strings are easy to interpret: "abc\u0FF0def"
is 7 char long with one non-Latin char, '\u0FF0'. When you walk an
index down the string, each index is exactly one char.
Now on to Java 1.5 and this fun UNICODE stuff where we have to do
UTF-16 in strings. According to the java api for Character:
"The Java 2 platform uses the UTF-16 representation in char arrays
and in the String and StringBuffer classes. In this representation,
supplementary characters are represented as a pair of char values, the
first from the high-surrogates range, (\uD800-\uDBFF), the second from
the low-surrogates range (\uDC00-\uDFFF)."
One could make a string like "abc\uD800\uDFF0def", but that is
ambiguous. It could easily be 7 or 8 characters depending on how you
interpret the string. The two char unicode sequence could actually be
UTF-16 representation for 1 char.
Given this ambiguity, I have a problem when building a lexer. If I see
the above string, I normally need to match each character against the
input. I have the input character as a full int and the Reader will
take care of pulling stuff off the disk, properly but what about how I
should interpret the string to do the matching???? Right now I walk an
index from 0..n-1 down the String asking for each 16-bit char.
Please tell me we don't have to have a mechanism for people to specify
the format of their strings in their grammars!!! Ugh. We are right
back to 1980's where in C we had to do our own UTF-8 or whatever
interpretations. Java will now do just about nothing for us as we have
the same exact situation, albeit with 16bits not 8. Booooooo!
Can anybody correct my misunderstanding? If I'm understanding this
correctly, can anybody suggest how to deal with UTF-16 in strings? Are
strings now int[]? Ick. Right back to C.
Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list