[antlr-interest] Unicode

Don Caton dcaton at shorelinesoftware.com
Mon Jan 30 05:57:49 PST 2006


I know the subject of Unicode comes up now and again, but it seems to me
that at best, the offered solutions are a hack.

It seems that it would be difficult, it not impossible to create a Unicode
parser and lexer from Antlr, given the way it currently generates code (I'm
talking specifically about C++ output here).

The C++ code generator insists on using string, rather than a typedef or
#define that could be set to string or wstring.  And throughout the
generated code as well as the static code, char * is assumed rather than
TCHAR *, single byte literal strings are used, etc. etc.

What's the rationale for this?  Is there something obvious I'm overlooking?
Unicode isn't exactly a new concept.  Why are we limited to the relatively
ancient world of 7 or 8 bit character sets?   

It seems to me that a few typedefs or #defines would make creating true
Unicode lexers and parsers a no-brainer and wouldn't break anything for
those who still need ansi parsers.

BTW, I know you can get a true Unicode parser from the C# code generator,
but I need C++.

Don




More information about the antlr-interest mailing list