[antlr-interest] Unicode

Tue Jan 31 07:49:19 PST 2006

Hi,

> > There are encoding prefixes in disk files, which I believe are standardized.

Well, on Windows boxes you have this horrible Byte Order Mark thing in
(some) text files. Unixes don't have something like that, the general
assumption (as far as I know) is that all files have to match the
encoding specified in LANG (LC_*) environment variables - everything
else is an error.

> There might be issues when you want to do a compare between a
> L"sumthin" (from say a literals table) with something you just got
> from disk.. have to match the encoding and/or have both encoded
> canonically (or what was the term in unicode) (it's been a while when
> I looked into this, but there might be a few gotchas (or I'm
> pessimistic ;) ))

The major problem is that even after you have both strings in the right
Unicode encoding (e.g. UTF-16) they might still be identical but
different. E.g. a German "ä" (A-umlaut) can be expressed as a "a"
followed by the special character for a combining diaeresis mark or by
the direct symbol for an "ä", depending on the normalization form of the
string. Probably the same for "ij" if that example is more appealing to
you ;-)

When I was using ANTLR/C++ I actually got along very good by treating
everything as UTF-8 and otherwise as "magic opaque values".

Martin