[antlr-interest] Unicode handling
Ric Klaren
klaren at cs.utwente.nl
Thu Apr 22 01:57:47 PDT 2004
Bulk reply ;)
On Wed, Apr 21, 2004 at 03:08:41PM -0700, Mark Lentczner wrote:
> My project's source files are Unicode, and we are using Antlr to
> generate the lexer, parser and compiler in C++.
Sorry but you're on your own there. Not supported in C++, too many
hacks/bad choices in the generated code and support lib that may bite you.
YMMV. Keep phonenumber of shrink at hand for support.
(Just to make clear you're preparing for headaches)
> Seems from the doc that Antlr isn't really ready to deal with the full
> compliment of Unicode characters. I found references to problems with
> EOF (integer -1, typecast to 0xFFFF as a character), problems with
> character sets (getting very large), and it seems that it assumes that
> Unicode characters are only 16 bits (which is no longer true.)
Well the C++ stuff truncates to 8 bits whenever it sees fit. Nothing 16 bit
there except for an unlucky signextension here and there due to people
using int's where unsigneds should be used.
> So, rather than try to work around or fix these problems, I intend to
> make my tool chain work with UTF-8 encoded source. (This is especially
> easy for us, since the the process feeding the source stream already
> normalizes the incoming character set to UTF-8.)
This is an approach you might get away with though. But things might bite
you here and there.
> We'd be parsing the UTF-8 encoded version of these characters:
>
> NAME_START_CHAR:
> ':' | 'A'..'Z' | '_' | 'a'..'z'
> | '\u00C3' '\u0080'..'\u0096' // '\u00C0'-'\u00D6'
> | '\u00C3' '\u0098'..'\u00B6' // '\u00D8'-'\u00F6'
> | '\u00C3' '\u00B8'..'\u00BF' // '\u00F8'-'\u00FF'
> | '\u00C4'..'\u00CB' '\u0080'..'\u00BF' // '\u0100'-'\u02FF'
> | '\u00CD' '\u00B0'..'\u00BD' // '\u0370'-'\u037D'
> | '\u00CD' '\u00BF' // '\u037F'
> | '\u00CE'..'\u00DF' '\u0080'..'\u00BF' // '\u0380'-'\u07FF'
> | '\u00E0' '\u00A0'..'\u00BF' '\u0080'..'\u00BF' //
> '\u0800'-'\u0FFF'
> | '\u00E1' '\u0090'..'\u00BF' '\u0080'..'\u00BF' //
> '\u1000'-'\u1FFF'
> ... and so on ...
> ;
Note that 2.7.4 will barf out attempts at 16 bit char constants. 2.7.3 will
at times do a sign extension if you're unlucky (but should be safe most of
the time or alternative constructions available for.
> Does anyone see any pitfalls to this other than increasing the look
> ahead for the lexer?
See first part of reply. Keep a debugger at hand + the antlr source + your
favourite text editor.
> Since in our source language, all the meaningful punctuation is in the
> visible US-ASCII range, the only place the difference between parsing
> Unicode characters vs. UTF-8 encoded Unicode characters would be in things
> like the NAME token production.
If you trick antlr to make the right bitsets, you may get away by
handcoding/modifying the few rules that need to deal with UTF8 multibyte
sequences. The moment you put the icky bits in nice 8 bit strings you're
basically homefree except for sorting out the actual lenghts of the text
etc. You could get away with redefining the strings in antlr to wchars and
recompiling a hacked version of the support lib to have a bit more 'room'
to maneuver (sp?). That has been done before with some luck.
> This seems much more preferable to me than extending the C++ support
> with some Unicode library (like IBM's icu or some such).
I commend you if you do it with the current support lib (in both cases ;) )
On Wed, Apr 21, 2004 at 04:31:14PM -0700, John D. Mitchell wrote:
> >>>>> "Mark" == Mark Lentczner <markl at glyphic.com> writes:
> > This seems much more preferable to me than extending the C++ support with
> > some Unicode library (like IBM's icu or some such).
>
> I concur.
Might be preferable over reinventing the wheel though. And for me a lot
quicker to implement stuff (unless there's volunteers out there?).
> In fact, I almost took that same approach but I was able to dodge the
> Unicode bullet completely. :-)
Lucky lucky ;)
> For Antlr v3, aside from my perennial haranguing for complete and proper
> hoisting support, I really want to get rid of all of this ridiculous use of
> in-band signalling. Please join me in pestering Ter about this. :-)
Erm this means I'm added to your pester list <shudders-in-horror>. Shoot it
was so convenient to have Ter take the heat all the time....
On Wed, Apr 21, 2004 at 05:36:23PM -0700, Terence Parr wrote:
> UNICODE will work well.
If you're lucky.
> Note that 2.7.3 should do pretty well at
> UNICODE. Give it a shot :) \uFFFE is the max valid unicode right? -1
> shouldn't be a problem anymore.
Because:
On Wed, Apr 21, 2004 at 08:30:56PM -0700, Mark Lentczner wrote:
> No, it is not. U+10FFFF is (Since Unicode 3.1). Yup, 21 bits.
And I also wonder what you'll get if you feed the lexer in java mode a
sequence that contains such a value broken up over two UTF-16 values (that
for lexer terms should be treated as one!). So I don't think even Java mode
is safe from headaches. The lexer has to be unicode aware (in a sense this
might be nice since there's than a chance of sharing more code between the
different target languages ;) )
On Thu, Apr 22, 2004 at 02:46:41AM +0200, Sebastian Kaliszewski wrote:
> Dnia czw 22. kwiecieñ 2004 02:36, Terence Parr napisa³:
> > UNICODE will work well. Note that 2.7.3 should do pretty well at
> > UNICODE. Give it a shot :) \uFFFE is the max valid unicode right? -1
> > shouldn't be a problem anymore.
>
> Also in C++ mode?
If you're not feeding it complete unicode input ;) and restrict it to plain
character ASCII kindoff stuff you should have no problem ;) In general one
big *no*.
On Wed, Apr 21, 2004 at 08:30:56PM -0700, Mark Lentczner wrote:
> Note that Java is broken in this regard. See
> http://weblogs.java.net/pub/wlg/1202 for a discussion. I understand
> that some XML tools in Java go to great lengths to get around the
> problem.
>
> > Oh. I think C++ doesn't handle UNICODE yet, but I'll let Ric answer
> > this ;)
> And indeed, I'm generating C++...
Which means you're out of luck, sadly.
Options:
- Hack around things and create a maintenance nightmare (most probable).
- Find (handcode) a different lexer but probably you'll have to tinker a bit with the
support lib.
- Don't use ANTLR.
Cheers,
Ric
--
-----+++++*****************************************************+++++++++-------
---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893722 ----
-----+++++*****************************************************+++++++++-------
Chaos often breeds life, when order breeds habit.
--- Henry B. Adams, The Education of Henry Adams
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list