[antlr-interest] unicode 16bit versus new 21bit stuff

Mark Lentczner markl at glyphic.com
Sat Jun 19 15:36:46 PDT 2004


Seems to me that you can still encode chars and tokens in the same 32 
bit int:
	any value <= 0x10FFFF is Unicode
	any value >  0x10FFFF is a Token type

Or am I missing something?

As for doing Unicode "right" - Yes, you have to do all 2^11 characters 
- it's not just the stuff people make fun of: There are real languages 
whose characters got up stuck above 16 bits.

As for 64-bit ints: My code is going to have to run on legions of aging 
32 bit hardware for years to come - I'd avoid 64 ints if you can.


> The new system will be cool.  You'll be able to use  
> Character.UnicodeBlock stuff such as vocabulary=BENGALI;
I doubt this will be useful to anyone.  You should check if anyone 
would use it.  The Unicode blocks rarely correspond to semantically 
useful subsets for parsing.  It is highly unlikely that any grammar 
would want to have "vocabulary=BENGALI" - there would be no punctuation 
in such a language.  As for character class tests - one usually can't 
include whole blocks.  Many, if not all, blocks have characters that 
for parsing would need to be excepted out of any grammar character 
class.

Much more useful would access to the Unicode character classes and 
character property sets like identifier_start, identifier_extend, L, Nd 
and such.

	- Mark



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list