[antlr-interest] Unicode
Don Caton
dcaton at shorelinesoftware.com
Mon Jan 30 05:57:49 PST 2006
I know the subject of Unicode comes up now and again, but it seems to me
that at best, the offered solutions are a hack.
It seems that it would be difficult, it not impossible to create a Unicode
parser and lexer from Antlr, given the way it currently generates code (I'm
talking specifically about C++ output here).
The C++ code generator insists on using string, rather than a typedef or
#define that could be set to string or wstring. And throughout the
generated code as well as the static code, char * is assumed rather than
TCHAR *, single byte literal strings are used, etc. etc.
What's the rationale for this? Is there something obvious I'm overlooking?
Unicode isn't exactly a new concept. Why are we limited to the relatively
ancient world of 7 or 8 bit character sets?
It seems to me that a few typedefs or #defines would make creating true
Unicode lexers and parsers a no-brainer and wouldn't break anything for
those who still need ansi parsers.
BTW, I know you can get a true Unicode parser from the C# code generator,
but I need C++.
Don
More information about the antlr-interest
mailing list