[antlr-interest] unicode 16bit versus new 21bit stuff
Terence Parr
parrt at cs.usfca.edu
Sat Jun 19 16:00:08 PDT 2004
On Jun 19, 2004, at 3:36 PM, Mark Lentczner wrote:
> Seems to me that you can still encode chars and tokens in the same 32
> bit int:
> any value <= 0x10FFFF is Unicode
> any value > 0x10FFFF is a Token type
>
> Or am I missing something?
Heh, you're right. I was focused on only 11 bits left, but if I treat
it as a 32-bit int not 2 smaller ints, then the values work out great!.
We have 0x10FFFF+1 .. 0xFFFFFFFF to mess with. That's um...lots. ;)
Thanks!
> As for doing Unicode "right" - Yes, you have to do all 2^11 characters
> - it's not just the stuff people make fun of: There are real languages
> whose characters got up stuck above 16 bits.
Roger that.
> As for 64-bit ints: My code is going to have to run on legions of aging
> 32 bit hardware for years to come - I'd avoid 64 ints if you can.
Ok, makes sense. SOunds like we good to go though.
>> The new system will be cool. You'll be able to use
>> Character.UnicodeBlock stuff such as vocabulary=BENGALI;
> I doubt this will be useful to anyone. You should check if anyone
> would use it. The Unicode blocks rarely correspond to semantically
> useful subsets for parsing. It is highly unlikely that any grammar
> would want to have "vocabulary=BENGALI" - there would be no punctuation
> in such a language. As for character class tests - one usually can't
> include whole blocks. Many, if not all, blocks have characters that
> for parsing would need to be excepted out of any grammar character
> class.
Ah. Well, I was going to allow sets like DIGIT | BENGALI
but I'm not sure how to handle the DIGIT "block" as it's context
sensitive, right? I.e., wouldn't we want DIGIT to only allow BENGALI
digits?
> Much more useful would access to the Unicode character classes and
> character property sets like identifier_start, identifier_extend, L, Nd
> and such.
Yep, all that will be allowed.
Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list