[antlr-interest] unicode 16bit versus new 21bit stuff
Terence Parr
parrt at cs.usfca.edu
Fri Jun 18 17:49:57 PDT 2004
Gang,
I thought I was going to be able to get away with 16bit unicode values
as Java seems to encode the "supplemental" crud via UTF-16 in char
arrays / strings. But, now I see in Character that they are adding
methods with int not char arguments to handle the beyond 16bit stuff.
My analysis algorithms use pure int so there is no trouble with that,
however, I do encode token types in the upper 16 bits of a 32 int and
have all chars in the lower 16 bits. This is purely programming
convenience as I know how to print out a token type by it's value
range. I don't want to go to 64-bit ints as most CPUs are still 32bits
natively. If I use 21-bit unicode values, that leaves 2^11 or 2048
token type values, which makes me a bit nervous.
I want to do unicode "right" this time. Anybody have a strong opinion
about the new supplemental (beyond 16bit unicode) char values and/or
whether 2048 is a serious token type limitation?
The new system will be cool. You'll be able to use
Character.UnicodeBlock stuff such as vocabulary=BENGALI;
Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list