[antlr-interest] unicode strings using supplemental char range

Terence Parr parrt at cs.usfca.edu
Sun Jun 20 11:52:36 PDT 2004


Ok, another question about unicode and strings.  With 16-bit unicode 
and 16-bit chars in Java, strings are easy to interpret: "abc\u0FF0def" 
is 7 char long with one non-Latin char, '\u0FF0'.  When you walk an 
index down the string, each index is exactly one char.

Now on to Java 1.5 and this fun UNICODE stuff where we have to do 
UTF-16 in strings.  According to the java api for Character:

   "The Java  2 platform uses the UTF-16 representation in char  arrays 
and in the String and StringBuffer  classes.  In this representation, 
supplementary characters are represented as a pair of char values, the 
first from the high-surrogates range, (\uD800-\uDBFF), the  second from 
the low-surrogates range  (\uDC00-\uDFFF)."

One could make a string like "abc\uD800\uDFF0def", but that is 
ambiguous.  It could easily be 7 or 8 characters depending on how you 
interpret the string.  The two char unicode sequence could actually be 
UTF-16 representation for 1 char.

Given this ambiguity, I have a problem when building a lexer.  If I see 
the above string, I normally need to match each character against the 
input.  I have the input character as a full int and the Reader will 
take care of pulling stuff off the disk, properly but what about how I 
should interpret the string to do the matching????  Right now I walk an 
index from 0..n-1 down the String asking for each 16-bit char.

Please tell me we don't have to have a mechanism for people to specify 
the format of their strings in their grammars!!! Ugh.  We are right 
back to 1980's where in C we had to do our own UTF-8 or whatever 
interpretations.  Java will now do just about nothing for us as we have 
the same exact situation, albeit with 16bits not 8.  Booooooo!

Can anybody correct my misunderstanding?  If I'm understanding this 
correctly, can anybody suggest how to deal with UTF-16 in strings?  Are 
strings now int[]?  Ick.  Right back to C.

Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing





 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list