[antlr-interest] unicode strings using supplemental char range
Mark Lentczner
markl at glyphic.com
Thu Jun 24 15:20:46 PDT 2004
[ Sorry this is a few days behind, I've been involved in horrible IETF
land... ]
> One could make a string like "abc\uD800\uDFF0def", but that is
> ambiguous. It could easily be 7 or 8 characters depending on how you
> interpret the string. The two char unicode sequence could actually be
> UTF-16 representation for 1 char.
No, this isn't ambiguous. It is 7 characters long. The two char (Java
type) sequence are two 16-bit units in the UTF-16 sequence for 1 code
point (Unicode character).
> Right now I walk an index from 0..n-1 down the String asking for each
> 16-bit char.
You can't do that anymore. I believe that Java 1.5 will have alternate
indexing operations: 1) The char (Java type) version takes an index
from 0..c-1 and returns a char (Java type) which is really a 16-bit
unit of UTF-16. 2) The int (Java type) version takes an index from
0..n-1 (where n <= c) and returns an int (Java type) which is a Unicode
code-point.
If you could rely on having Java 1.5, then you could simply use the int
versions of the String methods. Since you probably can't, you'll have
to write your own string wrapper.
> Please tell me we don't have to have a mechanism for people to specify
> the format of their strings in their grammars!!!
No, I think 3.0 should state unequivocally that String (Java type)
objects used for input are in accordance with the Java 1.5
understanding - that is, they are char (Java type) arrays that contain
UTF-16 16-bit units. Since the surrogate pair area has always been
illegal as Unicode characters, there is no backward compatibility
problem with other Strings.
The only problem you will have is people who wrote grammars knowing
that their input was UTF-16, and knowing that Antlr scanned chars (Java
type) not characters (Unicode) - these folks will break if the encoded
lexer or parser rules with things like "\uD800\uDFF0" in them. Oh
well...
> Java will now do just about nothing for us as we have
> the same exact situation, albeit with 16bits not 8. Booooooo!
Yup.
> Actually, I just had an idea. First, thanks to your help, I know that
> UTF-16 encoded in a string is unambiguously UTF-16. Now, the only
> question is, how do we match a 21-bit char against it? What if we just
> specified that the input must be UTF-16 also? Then, ANTLR can pretend
> everything is 16 bits, right?
Well, as you pointed out, this is like my hack of lexing UTF-8 for my
parsers in C++. Operative word is HACK. The other problem is that
this will fall apart as soon as you want to put in the other cool
Unicode class based checkes (isIdentifierStart, isLowerCase, etc...).
Sorry Terrence, suck it up and change all Strings to UnicodeArray which
is a class wrapper around int[]. Better yet, make it a protocol, and
then supply implementations that scan over String, over int[], and
perhaps over UTF-8 encoded byte[]...
- Mark
Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list