[antlr-interest] Unicode

Don Caton dcaton at shorelinesoftware.com
Tue Jan 31 08:37:39 PST 2006


Ric:

> http://www-306.ibm.com/software/globalization/icu/index.jsp
> 
> It seems to be one of the most portable ways of supporting unicode.
> Although I'm reluctant to tie (part of) the antlr runtime to 
> such a big library.

I don't think anything like this is necessary. Let the end user determine
what the data actually represents and what encodings if any, are present.

The only thing I'm after here is to remove the assumption in ANTLR that
characters are 8 bits wide.  If you stick with the wide character support
that's in the std C++ library, I think that will be sufficient and eliminate
any portability concerns.

Antlr shouldn't worry about how something is encoded on disk or any of that,
it's up to the programmer to provide correct input and take those things
into consideration.  Just let me process 16-bit characters end-to-end and
store them in a wstring (or whatever I may choose to define as 'char_type'
and 'string_type').

> There might be issues when you want to do a compare between a 
> L"sumthin" (from say a literals table) with something you 
> just got from disk.. have to match the encoding and/or have 
> both encoded canonically (or what was the term in unicode) 
> (it's been a while when I looked into this, but there might 
> be a few gotchas (or I'm pessimistic ;) ))

That should be the end-user's responsibility to properly compare the two
strings.  Antlr should just compare them for binary equality.  Anything more
specific should be left for the end-user to override in a subclass, if
necessary.

> I guess we'll see. What platform(s) are you working on? 

Windows and Visual Studio 2005.

> Once 
> you have something it might be an idea to do a sanity check 
> against some other os/compiler combinations to see how 
> portable things are.

If it's properly abstracted, then any other compiler/os combination can just
define their own set of typedefs.  Or not, if they don't support Unicode.
By default, Antlr would continue to use 'char' and  'std::string' and
'istream' so it shouldn't create any portability or backwards compatibility
issues.

Don




More information about the antlr-interest mailing list