[antlr-interest] Unicode

Mon Jan 30 06:49:13 PST 2006

Hi Don,

On 1/30/06, Don Caton <dcaton at shorelinesoftware.com> wrote:
> I know the subject of Unicode comes up now and again, but it seems to me
> that at best, the offered solutions are a hack.

I've heart of a few cases of at least some success with the approach
from the unicode example. (Still some trouble with string literal
testing)

> It seems that it would be difficult, it not impossible to create a Unicode
> parser and lexer from Antlr, given the way it currently generates code (I'm
> talking specifically about C++ output here).
>
> The C++ code generator insists on using string, rather than a typedef or
> #define that could be set to string or wstring.  And throughout the
> generated code as well as the static code, char * is assumed rather than
> TCHAR *, single byte literal strings are used, etc. etc.

The string/wstring thing can be solved to some extent with the unicode
examples approach. For general unicode support it's probably easiest
(and most portable) to build on top of ICU.
When working on the unicode example I tried to fix some issues with
the codegenerator. If more is needed I can give a hand (if time
permits)

> What's the rationale for this?

I don't think any rationale, it's old code ;) Although the standard
tools/libs provided for C++ to do unicode-ish things seem to be not
very standard across various compiler implementations, so that may
have been a reason not to bother. I came to be C++ maintainer at a
later point in time so no idea.

> Is there something obvious I'm overlooking?

I'm not sure if you've seen the latest unicode example, it provides a
framework (although you need to fill in some blanks) on how to get a
unicode stream read through the lexer and packaged up again for the
parser. I did not investigate in how far it can be made to play nice
with the AST stuff. There's not really a standard way on how unicode
is handled in C++ e.g. dealing with encodings from files on disk. I'm
not even sure if it is standardized how a wide string constant is
encoded (I thought it was implementation dependent but not 100% sure
from the top of my head).

> Unicode isn't exactly a new concept.  Why are we limited to the relatively
> ancient world of 7 or 8 bit character sets?

I hope to do better for the support lib for antlr3 (through the use of
templates it will be easier to plug stuff in but I'll probably
standardize on ICU (next to old school 8-bit)).

> It seems to me that a few typedefs or #defines would make creating true
> Unicode lexers and parsers a no-brainer and wouldn't break anything for
> those who still need ansi parsers.

If you can provide sane patches to do this I'm happy to incorporate
them in antlr2.

Cheers,

Ric