[antlr-interest] Unicode

Don Caton dcaton at shorelinesoftware.com
Tue Jan 31 06:53:55 PST 2006


> Hi Don,

> The string/wstring thing can be solved to some extent with 
> the unicode examples approach. 

The "Unicode" example isn't a Unicode program though.  There's still hard
coded std::string, literal ansi strings, etc.  It might lex Unicode input,
but you still end up with code that is ansi, with UTF8 encoded strings.  A
reasonable solution in some cases perhaps, but not a true end-to-end Unicode
system.  

For a true Unicode program you need to use wstring, wcout, literal Unicode
strings, and so on.  All of which can be abstracted with a handful of
typedefs.  The Unicode example does a little of that by using the
'char_type' typedef.  That just needs to be continued throughout all the
code, including the generated code.

> For general unicode support 
> it's probably easiest (and most portable) to build on top of ICU.

ICU?

> I don't think any rationale, it's old code ;) Although the 
> standard tools/libs provided for C++ to do unicode-ish things 
> seem to be not very standard across various compiler 
> implementations, so that may have been a reason not to 
> bother. I came to be C++ maintainer at a later point in time 
> so no idea.

Understood.

> There's not really a standard way on how unicode is handled 
> in C++ e.g. dealing with encodings from files on disk. I'm 
> not even sure if it is standardized how a wide string 
> constant is encoded (I thought it was implementation 
> dependent but not 100% sure from the top of my head).

There are encoding prefixes in disk files, which I believe are standardized.
But that could be left to the end user, you just have to insure that you're
providing valid unicode input.

All of this could be encapsulated in #defines and typedefs though, and any
compiler that supports the standard C++ library should be able to handle it.
That could be abstracted out in config.hpp, perhaps something like:

#if defined( MSC_VER ) && defined( _UNICODE )
   #define ANTLR_UNICODE
#else
   ...  // other compilers that support unicode
#endif

#ifdef ANTLR_UNICODE
   typedef char_type   wchar_t;
   typedef string_type wstring;
   typedef stream_type wistream;
   #define _T( x ) L##x
#else
   typedef char_type   char;
   typedef string_type string;  
   typedef stream_type istream;
   #define _T( x ) x
#endif

and so on...

> If you can provide sane patches to do this I'm happy to 
> incorporate them in antlr2.

I'll take a look and see what I can do.  Problem is that I'm using a
customized version of the runtime stuff that inlines a lot of stuff, for
performance reasons.  Let me see what I can come up with.  

Don




More information about the antlr-interest mailing list