[antlr-interest] Unicode

Tue Jan 31 21:52:47 PST 2006

>> There are many horrible things about Windows, but this isn't one of
them.
>> 
>> http://www.unicode.org/faq/utf_bom.html#22

> Well I know it's standardised, it's just not used on Unix AFAIK. And I
> think that's a good decision, with Unicode there is really no reason
to
> have a mixed-encoding system.

Here are my 2 cents regarding Unicode support.

There is a lot of data in existence that isn't encoded in Unicode so
even though Unicode is the primary encoding we process data in (for the
application area I handle), we must still be able to recognize and
convert data from other encodings and produce other encodings for
API/systems that don't support Unicode.  So multi/mixed encoding isn't
going away anytime soon - Unicode simply makes it easier to write code
with less dependence on knowing details about numerous other encodings.

We use a BOM and it simplified various things.  If all (internal & 3rd
party) libraries were able to guess encodings (reliably) then we
wouldn't need to use a BOM.  Since this isn't the case, with a BOM, we
are able to reuse data across platforms many times without needing to
convert the encoding (mainly endianess of UTF-16 in our case).  

Also, we use ICU4C (which is a fine library) and it simplifies our life
dramatically.  The choice of using it or not in an application (in our
case) is based on the need for a consistent representation for
particular libraries used and the performance characteristics the
application required.  The size of wchar on the various platforms can
vary and some libraries process data using a character size different
from the native wchar. So if processing little data and/or not using any
libraries with a different representation of UTF-16 - ICU might be
overkill.  For data intensive applications which require/process UTF-16
data represented differently than the native wchar - ICU is a great
choice.  The biggest pitfall is the (potential) need to convert data to
the native wchar for any required system/runtime calls.

So for Antlr v3, using typedefs for the character/string representation
is fine as long as the template/runtime support can account for a UTF-16
representation different than the native wchar.  i.e. for Antlr v3 C/C++
output, if the default runtime/templates/whatever used C runtime string
functions - it would not work for us if we needed to parse certain
UTF-16 data, we would need to re-implement the
runtime/templates/whatever using ICU so our code would work across
platforms. We tend to deem the cost of converting the data
representation (to/from wchar) as too high and make substantial attempts
to make it unnecessary.  

I'm also not sure about the use of straight binary comparisons
(mentioned in an earlier thread I think) - as someone else mentioned,
Unicode chars can be encoded in multiple ways and the options for
correct comparison involves either normalizing the data or using
comparison methods which account for denormalized data internally.  I
think that the second option would be preferred for us.  

So in general, if character/string types and all string processing
methods were defined in templates, it seems it would allow us to tailor
the behavior to our needs.  Hmm, I was kinda verbose - maybe I should
have just stuck with this paragraph.

- Scott

p.s. Thanks to Terence and all the people who are contributing!