[antlr-interest] unicode 16bit versus new 21bit stuff

Terence Parr parrt at cs.usfca.edu
Fri Jun 18 17:49:57 PDT 2004


Gang,

I thought I was going to be able to get away with 16bit unicode values 
as Java seems to encode the "supplemental" crud via UTF-16 in char 
arrays / strings.  But, now I see in Character that they are adding 
methods with int not char arguments to handle the beyond 16bit stuff.

My analysis algorithms use pure int so there is no trouble with that, 
however, I do encode token types in the upper 16 bits of a 32 int and 
have all chars in the lower 16 bits.  This is purely programming 
convenience as I know how to print out a token type by it's value 
range.  I don't want to go to 64-bit ints as most CPUs are still 32bits 
natively.  If I use 21-bit unicode values, that leaves 2^11 or 2048 
token type values, which makes me a bit nervous.

I want to do unicode "right" this time.  Anybody have a strong opinion 
about the new supplemental (beyond 16bit unicode) char values and/or 
whether 2048 is a serious token type limitation?

The new system will be cool.  You'll be able to use 
Character.UnicodeBlock stuff such as vocabulary=BENGALI;

Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing





 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list