[antlr-interest] Unicode escapes in C++

Ric Klaren ric.klaren at gmail.com
Tue Nov 7 12:06:38 PST 2006


Hi,

Kochismo wrote:
> I'm interested in parsing a plain ascii file which represents unicode
> characters as escaped hex digits.  For example:
> 
> blah\uff20\uff30blah
> 
> is the string blah,  unicode character #ff20, unicode character #ff30, then
> blah.  Recognising it with the lexer is simple enough, but the lexer
> returns
> tokens as C++ strings, rather than unicode friendly wstrings.  Is there a
> way I can handle this from within the lexer?  Or will I have to write code
> to convert the string token into a wstring?

You can probably get some inspiration for this from the Unicode C++
example in the distribution. You probably only need to pay attention to
the part where the strings for the tokens are collected.

Cheers,

Ric


More information about the antlr-interest mailing list