[antlr-interest] unicode 16bit versus new 21bit stuff

Terence Parr parrt at cs.usfca.edu
Sat Jun 19 16:00:08 PDT 2004


On Jun 19, 2004, at 3:36 PM, Mark Lentczner wrote:

> Seems to me that you can still encode chars and tokens in the same 32
> bit int:
> 	any value <= 0x10FFFF is Unicode
> 	any value >  0x10FFFF is a Token type
>
> Or am I missing something?

Heh, you're right.  I was focused on only 11 bits left, but if I treat 
it as a 32-bit int not 2 smaller ints, then the values work out great!. 
  We have 0x10FFFF+1 .. 0xFFFFFFFF to mess with.  That's um...lots. ;)  
Thanks!

> As for doing Unicode "right" - Yes, you have to do all 2^11 characters
> - it's not just the stuff people make fun of: There are real languages
> whose characters got up stuck above 16 bits.

Roger that.

> As for 64-bit ints: My code is going to have to run on legions of aging
> 32 bit hardware for years to come - I'd avoid 64 ints if you can.

Ok, makes sense.  SOunds like we good to go though.

>> The new system will be cool.  You'll be able to use
>> Character.UnicodeBlock stuff such as vocabulary=BENGALI;
> I doubt this will be useful to anyone.  You should check if anyone
> would use it.  The Unicode blocks rarely correspond to semantically
> useful subsets for parsing.  It is highly unlikely that any grammar
> would want to have "vocabulary=BENGALI" - there would be no punctuation
> in such a language.  As for character class tests - one usually can't
> include whole blocks.  Many, if not all, blocks have characters that
> for parsing would need to be excepted out of any grammar character
> class.

Ah.  Well, I was going to allow sets like DIGIT | BENGALI

but I'm not sure how to handle the DIGIT "block" as it's context 
sensitive, right?  I.e., wouldn't we want DIGIT to only allow BENGALI 
digits?

> Much more useful would access to the Unicode character classes and
> character property sets like identifier_start, identifier_extend, L, Nd
> and such.

Yep, all that will be allowed.

Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing





 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list