[antlr-interest] Unicode
Don Caton
dcaton at shorelinesoftware.com
Tue Jan 31 06:53:55 PST 2006
> Hi Don,
> The string/wstring thing can be solved to some extent with
> the unicode examples approach.
The "Unicode" example isn't a Unicode program though. There's still hard
coded std::string, literal ansi strings, etc. It might lex Unicode input,
but you still end up with code that is ansi, with UTF8 encoded strings. A
reasonable solution in some cases perhaps, but not a true end-to-end Unicode
system.
For a true Unicode program you need to use wstring, wcout, literal Unicode
strings, and so on. All of which can be abstracted with a handful of
typedefs. The Unicode example does a little of that by using the
'char_type' typedef. That just needs to be continued throughout all the
code, including the generated code.
> For general unicode support
> it's probably easiest (and most portable) to build on top of ICU.
ICU?
> I don't think any rationale, it's old code ;) Although the
> standard tools/libs provided for C++ to do unicode-ish things
> seem to be not very standard across various compiler
> implementations, so that may have been a reason not to
> bother. I came to be C++ maintainer at a later point in time
> so no idea.
Understood.
> There's not really a standard way on how unicode is handled
> in C++ e.g. dealing with encodings from files on disk. I'm
> not even sure if it is standardized how a wide string
> constant is encoded (I thought it was implementation
> dependent but not 100% sure from the top of my head).
There are encoding prefixes in disk files, which I believe are standardized.
But that could be left to the end user, you just have to insure that you're
providing valid unicode input.
All of this could be encapsulated in #defines and typedefs though, and any
compiler that supports the standard C++ library should be able to handle it.
That could be abstracted out in config.hpp, perhaps something like:
#if defined( MSC_VER ) && defined( _UNICODE )
#define ANTLR_UNICODE
#else
... // other compilers that support unicode
#endif
#ifdef ANTLR_UNICODE
typedef char_type wchar_t;
typedef string_type wstring;
typedef stream_type wistream;
#define _T( x ) L##x
#else
typedef char_type char;
typedef string_type string;
typedef stream_type istream;
#define _T( x ) x
#endif
and so on...
> If you can provide sane patches to do this I'm happy to
> incorporate them in antlr2.
I'll take a look and see what I can do. Problem is that I'm using a
customized version of the runtime stuff that inlines a lot of stuff, for
performance reasons. Let me see what I can come up with.
Don
More information about the antlr-interest
mailing list