[antlr-interest] unicode strings using supplemental char range

Mark Lentczner markl at glyphic.com
Thu Jun 24 15:20:46 PDT 2004


[ Sorry this is a few days behind, I've been involved in horrible IETF 
land... ]

> One could make a string like "abc\uD800\uDFF0def", but that is
> ambiguous.  It could easily be 7 or 8 characters depending on how you
> interpret the string.  The two char unicode sequence could actually be
> UTF-16 representation for 1 char.
No, this isn't ambiguous.  It is 7 characters long.  The two char (Java 
type) sequence are two 16-bit units in the UTF-16 sequence for 1 code 
point (Unicode character).

> Right now I walk an index from 0..n-1 down the String asking for each 
> 16-bit char.
You can't do that anymore.  I believe that Java 1.5 will have alternate 
indexing operations: 1) The char (Java type) version takes an index 
from 0..c-1 and returns a char (Java type) which is really a 16-bit 
unit of UTF-16.  2) The int (Java type) version takes an index from 
0..n-1 (where n <= c) and returns an int (Java type) which is a Unicode 
code-point.

If you could rely on having Java 1.5, then you could simply use the int 
versions of the String methods.  Since you probably can't, you'll have 
to write your own string wrapper.

> Please tell me we don't have to have a mechanism for people to specify
> the format of their strings in their grammars!!!
No, I think 3.0 should state unequivocally that String (Java type) 
objects used for input are in accordance with the Java 1.5 
understanding - that is, they are char (Java type) arrays that contain 
UTF-16 16-bit units.  Since the surrogate pair area has always been 
illegal as Unicode characters, there is no backward compatibility 
problem with other Strings.

The only problem you will have is people who wrote grammars knowing 
that their input was UTF-16, and knowing that Antlr scanned chars (Java 
type) not characters (Unicode) - these folks will break if the encoded 
lexer or parser rules with things like "\uD800\uDFF0" in them.  Oh 
well...

> Java will now do just about nothing for us as we have
> the same exact situation, albeit with 16bits not 8.  Booooooo!
Yup.

> Actually, I just had an idea.  First, thanks to your help, I know that
> UTF-16 encoded in a string is unambiguously UTF-16.  Now, the only
> question is, how do we match a 21-bit char against it?  What if we just
> specified that the input must be UTF-16 also?  Then, ANTLR can pretend
> everything is 16 bits, right?
Well, as you pointed out, this is like my hack of lexing UTF-8 for my 
parsers in C++.  Operative word is HACK.  The other problem is that 
this will fall apart as soon as you want to put in the other cool 
Unicode class based checkes (isIdentifierStart, isLowerCase, etc...).

Sorry Terrence, suck it up and change all Strings to UnicodeArray which 
is a class wrapper around int[].  Better yet, make it a protocol, and 
then supply implementations that scan over String, over int[], and 
perhaps over UTF-8 encoded byte[]...

	- Mark


Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list