[antlr-interest] Unicode handling

Thu Apr 22 01:57:47 PDT 2004

Bulk reply ;)

On Wed, Apr 21, 2004 at 03:08:41PM -0700, Mark Lentczner wrote:
> My project's source files are Unicode, and we are using Antlr to
> generate the lexer, parser and compiler in C++.

Sorry but you're on your own there. Not supported in C++, too many
hacks/bad choices in the generated code and support lib that may bite you.
YMMV. Keep phonenumber of shrink at hand for support.

(Just to make clear you're preparing for headaches)

> Seems from the doc that Antlr isn't really ready to deal with the full
> compliment of Unicode characters.  I found references to problems with
> EOF (integer -1, typecast to 0xFFFF as a character), problems with
> character sets (getting very large), and it seems that it assumes that
> Unicode characters are only 16 bits (which is no longer true.)

Well the C++ stuff truncates to 8 bits whenever it sees fit. Nothing 16 bit
there except for an unlucky signextension here and there due to people
using int's where unsigneds should be used.

> So, rather than try to work around or fix these problems, I intend to
> make my tool chain work with UTF-8 encoded source.  (This is especially
> easy for us, since the the process feeding the source stream already
> normalizes the incoming character set to UTF-8.)

This is an approach you might get away with though. But things might bite
you here and there.

> We'd be parsing the UTF-8 encoded version of these characters:
>
> NAME_START_CHAR:
>      ':' | 'A'..'Z' | '_' | 'a'..'z'
>      | '\u00C3' '\u0080'..'\u0096'           // '\u00C0'-'\u00D6'
>      | '\u00C3' '\u0098'..'\u00B6'           // '\u00D8'-'\u00F6'
>      | '\u00C3' '\u00B8'..'\u00BF'           // '\u00F8'-'\u00FF'
>      | '\u00C4'..'\u00CB' '\u0080'..'\u00BF' // '\u0100'-'\u02FF'
>      | '\u00CD' '\u00B0'..'\u00BD'           // '\u0370'-'\u037D'
>      | '\u00CD' '\u00BF'                     // '\u037F'
>      | '\u00CE'..'\u00DF' '\u0080'..'\u00BF' // '\u0380'-'\u07FF'
>      | '\u00E0' '\u00A0'..'\u00BF' '\u0080'..'\u00BF'    //
> '\u0800'-'\u0FFF'
>      | '\u00E1' '\u0090'..'\u00BF' '\u0080'..'\u00BF'    //
> '\u1000'-'\u1FFF'
>      ... and so on ...
>      ;

Note that 2.7.4 will barf out attempts at 16 bit char constants. 2.7.3 will
at times do a sign extension if you're unlucky (but should be safe most of
the time or alternative constructions available for.

> Does anyone see any pitfalls to this other than increasing the look
> ahead for the lexer?

See first part of reply. Keep a debugger at hand + the antlr source + your
favourite text editor.

> Since in our source language, all the meaningful punctuation is in the
> visible US-ASCII range, the only place the difference between parsing
> Unicode characters vs. UTF-8 encoded Unicode characters would be in things
> like the NAME token production.

If you trick antlr to make the right bitsets, you may get away by
handcoding/modifying the few rules that need to deal with UTF8 multibyte
sequences. The moment you put the icky bits in nice 8 bit strings you're
basically homefree except for sorting out the actual lenghts of the text
etc. You could get away with redefining the strings in antlr to wchars and
recompiling a hacked version of the support lib to have a bit more 'room'
to maneuver (sp?). That has been done before with some luck.

> This seems much more preferable to me than extending the C++ support
> with some Unicode library (like IBM's icu or some such).

I commend you if you do it with the current support lib (in both cases ;) )

On Wed, Apr 21, 2004 at 04:31:14PM -0700, John D. Mitchell wrote:
> >>>>> "Mark" == Mark Lentczner <markl at glyphic.com> writes:
> > This seems much more preferable to me than extending the C++ support with
> > some Unicode library (like IBM's icu or some such).
>
> I concur.

Might be preferable over reinventing the wheel though. And for me a lot
quicker to implement stuff (unless there's volunteers out there?).

> In fact, I almost took that same approach but I was able to dodge the
> Unicode bullet completely. :-)

Lucky lucky ;)

> For Antlr v3, aside from my perennial haranguing for complete and proper
> hoisting support, I really want to get rid of all of this ridiculous use of
> in-band signalling.  Please join me in pestering Ter about this. :-)

Erm this means I'm added to your pester list <shudders-in-horror>. Shoot it
was so convenient to have Ter take the heat all the time....

On Wed, Apr 21, 2004 at 05:36:23PM -0700, Terence Parr wrote:
> UNICODE will work well.

If you're lucky.

> Note that 2.7.3 should do pretty well at
> UNICODE.  Give it a shot :)  \uFFFE is the max valid unicode right?  -1
> shouldn't be a problem anymore.

Because:

On Wed, Apr 21, 2004 at 08:30:56PM -0700, Mark Lentczner wrote:
> No, it is not.  U+10FFFF is (Since Unicode 3.1).  Yup, 21 bits.

And I also wonder what you'll get if you feed the lexer in java mode a
sequence that contains such a value broken up over two UTF-16 values (that
for lexer terms should be treated as one!). So I don't think even Java mode
is safe from headaches. The lexer has to be unicode aware (in a sense this
might be nice since there's than a chance of sharing more code between the
different target languages ;) )

On Thu, Apr 22, 2004 at 02:46:41AM +0200, Sebastian Kaliszewski wrote:
> Dnia czw 22. kwiecieñ 2004 02:36, Terence Parr napisa³:
> > UNICODE will work well.  Note that 2.7.3 should do pretty well at
> > UNICODE.  Give it a shot :)  \uFFFE is the max valid unicode right?  -1
> > shouldn't be a problem anymore.
>
> Also in C++ mode?

If you're not feeding it complete unicode input ;) and restrict it to plain
character ASCII kindoff stuff you should have no problem ;) In general one
big *no*.

On Wed, Apr 21, 2004 at 08:30:56PM -0700, Mark Lentczner wrote:
> Note that Java is broken in this regard.  See
> http://weblogs.java.net/pub/wlg/1202 for a discussion.  I understand
> that some XML tools in Java go to great lengths to get around the
> problem.
>
> > Oh.  I think C++ doesn't handle UNICODE yet, but I'll let Ric answer
> > this ;)
> And indeed, I'm generating C++...

Which means you're out of luck, sadly.

Options:
- Hack around things and create a maintenance nightmare (most probable).
- Find (handcode) a different lexer but probably you'll have to tinker a bit with the
  support lib.
- Don't use ANTLR.

Cheers,

Ric
--
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893722  ----
-----+++++*****************************************************+++++++++-------
  Chaos often breeds life, when order breeds habit.
  --- Henry B. Adams, The Education of Henry Adams

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/