[antlr-interest] Unicode

Tue Jan 31 07:35:37 PST 2006

Hi,

On 1/31/06, Don Caton <dcaton at shorelinesoftware.com> wrote:
> > The string/wstring thing can be solved to some extent with
> > the unicode examples approach.
>
> The "Unicode" example isn't a Unicode program though.  There's still hard
> coded std::string, literal ansi strings, etc.  It might lex Unicode input,
> but you still end up with code that is ansi, with UTF8 encoded strings.  A
> reasonable solution in some cases perhaps, but not a true end-to-end Unicode
> system.

Yes true, it provides an example on how to make an antlr CharScanner
subclass that can deal with a certain kind of unicode input (UTF8) and
that can package the lexed tokens once more into something for backend
stuff. (in this case the choice was to reencode to UTF8 in
std::string's)

> For a true Unicode program you need to use wstring, wcout, literal Unicode
> strings, and so on.  All of which can be abstracted with a handful of
> typedefs.  The Unicode example does a little of that by using the
> 'char_type' typedef.  That just needs to be continued throughout all the
> code, including the generated code.

Agree, the runtime needs work and probably also the generated code.
The unicode example is a proof of concept thing. I lack a real
application to test the idea against and making it a standard feature
for antlr2.

> > For general unicode support
> > it's probably easiest (and most portable) to build on top of ICU.
>
> ICU?

This one:

http://www-306.ibm.com/software/globalization/icu/index.jsp

It seems to be one of the most portable ways of supporting unicode.
Although I'm reluctant to tie (part of) the antlr runtime to such a
big library.

> > There's not really a standard way on how unicode is handled
> > in C++ e.g. dealing with encodings from files on disk. I'm
> > not even sure if it is standardized how a wide string
> > constant is encoded (I thought it was implementation
> > dependent but not 100% sure from the top of my head).
>
> There are encoding prefixes in disk files, which I believe are standardized.

There might be issues when you want to do a compare between a
L"sumthin" (from say a literals table) with something you just got
from disk.. have to match the encoding and/or have both encoded
canonically (or what was the term in unicode) (it's been a while when
I looked into this, but there might be a few gotchas (or I'm
pessimistic ;) ))

> But that could be left to the end user, you just have to insure that you're
> providing valid unicode input.
>
> All of this could be encapsulated in #defines and typedefs though, and any
> compiler that supports the standard C++ library should be able to handle it.
> That could be abstracted out in config.hpp, perhaps something like:
>
> #if defined( MSC_VER ) && defined( _UNICODE )
>    #define ANTLR_UNICODE
> #else
>    ...  // other compilers that support unicode
> #endif
>
> #ifdef ANTLR_UNICODE
>    typedef char_type   wchar_t;
>    typedef string_type wstring;
>    typedef stream_type wistream;
>    #define _T( x ) L##x
> #else
>    typedef char_type   char;
>    typedef string_type string;
>    typedef stream_type istream;
>    #define _T( x ) x
> #endif
>
> and so on...
>
> > If you can provide sane patches to do this I'm happy to
> > incorporate them in antlr2.
>
> I'll take a look and see what I can do.  Problem is that I'm using a
> customized version of the runtime stuff that inlines a lot of stuff, for
> performance reasons.  Let me see what I can come up with.

I guess we'll see. What platform(s) are you working on? Once you have
something it might be an idea to do a sanity check against some other
os/compiler combinations to see how portable things are.

Cheers,

Ric