[antlr-interest] unicode strings using supplemental char range

Fri Jun 25 18:39:19 PDT 2004

> -----Original Message-----
> From: Terence Parr [mailto:parrt at cs.usfca.edu] 
> Sent: Thursday, June 24, 2004 7:27 PM
> To: antlr-interest at yahoogroups.com
> Subject: Re: [antlr-interest] unicode strings using 
> supplemental char range
> 
> 
> On Jun 24, 2004, at 3:20 PM, Mark Lentczner wrote:
> >> Actually, I just had an idea.  First, thanks to your help, I know 
> >> that
> >> UTF-16 encoded in a string is unambiguously UTF-16.  Now, the only 
> >> question is, how do we match a 21-bit char against it?  What if we 
> >> just specified that the input must be UTF-16 also?  Then, 
> ANTLR can 
> >> pretend everything is 16 bits, right?
> > Well, as you pointed out, this is like my hack of lexing 
> UTF-8 for my 
> > parsers in C++.  Operative word is HACK.  The other problem is that 
> > this will fall apart as soon as you want to put in the other cool 
> > Unicode class based checkes (isIdentifierStart, 
> isLowerCase, etc...).
> 
> Well, I was going to say that UTF-16 is the way I'll leave 
> until you said this last thing.  isLowerCase, for example, 
> simply won't work if we have UTF-16 strings.  I'll have to 
> take your word for it that real languages will use codes 
> above 16 bits, btw. ;)
> 

There is nothing preventing an implementation of 'isLowerCase' using
UTF-16
that I am aware of.  The real issue is that it is a lot of work (just 
working with Unicode).  That is why the ICU library is popular - it 
has done much of the tedious work. It is unfortunate that many/most of
the 
comments on the list indicate a desire to not use such a library - it 
contains a lot of useful functionality. 

There are comments in some Unicode docs that indicate that a number of
Unicode users use UTF-16 strings but for APIs handling individual 
Characters - the characters use UTF-32. This is what ICU4C appears to
do.

Regarding what languages reside outside 16-bits, 
according to the Unicode FAQ: http://www.unicode.org/faq/utf_bom.html#34
Under the UTF-16 section, I quote:
 "What is UTF-16?
A: UTF-16 uses a single 16-bitcode unit to encode the most common 63K
characters,
 and a pair of 16-bit code unites, called surrogates, to encode the 1M
less 
commonly used characters in Unicode."

If you want to get a better idea of what falls into the various unicode
ranges,
Check this out: http://www.unicode.org/charts/
An example of what requires surrogates is the 
"CJK Compatibility Ideographs Supplement".  It looks like a number of 
symbols (math/music) require surrogates.  

FYI - I am a very happy user of Antlr but I am feeling concerned about
the 
future Unicode support in C++ parsers created from Antlr 3.  We are
currently 
looking to internationalize a number of our components.  Some of those
are using
support libraries which use Antlr 2.6.1, 2.7.1 & Flex/Bison which is
looking 
a bit messy right now.  I am really hoping that Antlr 3 can "save the
day" 
for me like Antlr 2 did in the past.

> > Sorry Terrence, suck it up and change all Strings to UnicodeArray 
> > which
> 
> Shite.  Rats.  Argh!  That means I'm back to the days of 
> C/C++ where I have to define String.  Crap.  Anybody have any 
> idea what the speed hit will be for us LATIN encoded people?
> 
> > is a class wrapper around int[].  Better yet, make it a 
> protocol, and 
> > then supply implementations that scan over String, over int[], and 
> > perhaps over UTF-8 encoded byte[]...

We use UTF-16 exclusively in the middle-tier of our application(s).  
The reason we chose UTF-16 was that a number of other libraries and
tools 
use UTF-16 (or UCS-2) internally so by using it as well we had fewer 
encoding changes to deal with, less data movement in memory and somewhat

easier memory management.  I, unfortunately, don't have any benchmarks 
to indicate a speed differential. Most of our application logic *seems*
to
make a larger difference than the use of 16-bit character units.

> 
> ;)
> 
> Thanks a bunch for the clarifications...
> 
> Ter
> --
> CS Professor & Grad Director, University of San Francisco 
> Creator, ANTLR Parser Generator, http://www.antlr.org 
> Cofounder, http://www.jguru.com Cofounder, 
> http://www.knowspam.net enjoy email again!
> Cofounder, http://www.peerscope.com pure link sharing
> 
> 
> 
> 
> 
>  
> Yahoo! Groups Links
> 
> 
> 
>  
> 
> 
> 

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/