Fwd: [antlr-interest] ANTLR, C++ and UNICODE

Ric Klaren ric.klaren at gmail.com
Tue Jan 25 09:45:02 PST 2005


CC'ing back to the list since this reply got loads of relevant stuff
for the list as well.

On Tue, 25 Jan 2005 07:56:47 -0800 (PST), Peggy Fieland
<madcapmaggie at yahoo.com> wrote:
> I need unicode support (UTF 8 will do).
...<snip>...
>

Have a look at the prerelease 2.7.5 snapshot it provides a rudimentary
way to approach things with a custom charscanner/inputbuffer. I have
no real life parsing problem/data that I can use to really test it
other than the 'this seems to work' status it is in now. The parts of
interest are in the examples/cpp/unicode example. Since the release of
the devel snapshot in september I had no feedback on this stuff.

Currently the example decodes UTF8 in the UnicodeCharBuffer class to
32bit values that are used inside the UnicodeCharScanner. After a
token is recognized the values are reencoded in UTF8 and stored in
std:string just for the proof of concept.

Things to be done are:
- adding more decoding schemes to the CharBuffer (and probably making
it efficient/autosensing/increase the enhancability of it) also what
is practical for this side of the stuff?
- find out what the practical ways are to store the recognized token
text. std::string is probably not the best (duh). I've heart many bad
comments with respect to wchar/wstring, I got the impression not many
people use it (?). There are of course the IBM ICU libraries, but I
would not like adding those as a standard dependency. For this I need
feedback from people actually using unicode. I only meddle with it
since people seem to want it. I have no experience with it other than
reading some specs/whitepapers.
- I did not try yet what it does with character literals/literals
table checks. Might need codegen tweaks?
- The error reporting probably needs an upgrade/change for unicode,
yet this depends on the way unicode strings are stored in tokens.
Probably also needs codegen tweaks and/or an extra grammar option for
C++. Might be an idea to take along internationalization changes as
well.
- probably a better encoding of bitsets is necessary for unicode and
anltr needs to generate them faster. It works now but it isn't fast
and it's bigger than necessary.

I have no personal 'itch' to fix unicode other than some people in the
community would like it. Practical feedback/testing with respect to
the prototype stuff would bring unicode support for C++ quicker.

> When is 2.7.6 due to come out?

When there's enough interesting changes to warrant a release (usable
C++ unicode would probably be a reason).

Cheers,

Ric


More information about the antlr-interest mailing list