[antlr-interest] unicode 16bit versus new 21bit stuff
Mark Lentczner
markl at glyphic.com
Sat Jun 19 15:36:46 PDT 2004
Seems to me that you can still encode chars and tokens in the same 32
bit int:
any value <= 0x10FFFF is Unicode
any value > 0x10FFFF is a Token type
Or am I missing something?
As for doing Unicode "right" - Yes, you have to do all 2^11 characters
- it's not just the stuff people make fun of: There are real languages
whose characters got up stuck above 16 bits.
As for 64-bit ints: My code is going to have to run on legions of aging
32 bit hardware for years to come - I'd avoid 64 ints if you can.
> The new system will be cool. You'll be able to use
> Character.UnicodeBlock stuff such as vocabulary=BENGALI;
I doubt this will be useful to anyone. You should check if anyone
would use it. The Unicode blocks rarely correspond to semantically
useful subsets for parsing. It is highly unlikely that any grammar
would want to have "vocabulary=BENGALI" - there would be no punctuation
in such a language. As for character class tests - one usually can't
include whole blocks. Many, if not all, blocks have characters that
for parsing would need to be excepted out of any grammar character
class.
Much more useful would access to the Unicode character classes and
character property sets like identifier_start, identifier_extend, L, Nd
and such.
- Mark
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list