[antlr-interest] Unicode handling

Ric Klaren klaren at cs.utwente.nl
Fri Apr 23 02:35:37 PDT 2004


Hi,

On Thu, Apr 22, 2004 at 01:01:20PM -0700, Mark Lentczner wrote:
> > Well the C++ stuff truncates to 8 bits whenever it sees fit.
> My plan is to not do anything in 16 bits.  Just lex the UTF-8 as a pure
> 8-bit stream.  So, so long as the Antlr generator and the generated C++
> code and the support lib are all 8-bit character clean, I'm home free.

It's clean mostly. But the fact that the people porting it used int's and
other signeds where they should have used unsigneds is sometimes coming to
the surface.

> > Note that 2.7.4 will barf out attempts at 16 bit char constants.
> Do you mean that 2.7.4 won't allow '\u00BF' but will allow '\277'?  Or
> will 2.7.4 only be upset if the upper byte isn't zero, i.e. '\u2004'?
> (Yes, I'm still on 2.7.3)

I think it accepts the constants like '\u00BF' but will check them for
being in the range 0-255. So the last (at least that's the intention).

> > If you trick antlr to make the right bitsets,
> Er, is there a known issue with bitsets and 8-bit high characters?

No I mean that you have to make sure that the dummy rule you put in during
generation generates the right first/follow sets. But you probably have
that covered ;)

> > The moment you put the icky bits in nice 8 bit strings you're
> > basically homefree except for sorting out the actual lenghts of the
> > text
> > etc.
> All the literals in our source language have only 7-bit strings, so no
> concern here.  But even if we supported something like U+F7 (the
> division sign, ÷, UTF-8 encoded as 0xC3 0xB7), then I'd just code the
> literal as "\u00C3\u00B7" or "\303\267", and let Antlr think it is a
> two-character string.

Sounds like it will work.

> > You could get away with redefining the strings in antlr to wchars and
> > recompiling a hacked version of the support lib to have a bit more
> > 'room'
> > to maneuver (sp?). That has been done before with some luck.
> <speak voice="kid from Time Bandits"> No, don't touch it.... wchar is
> EEEEEEVIL! </speak>

What is particularly evil about it ? Just curious ;) (Note I personally
have never tinkered with tiresome unicode stuff other than reading up a
little with antlr in mind, so real world experiences are most welcome)

> > I commend you if you do it with the current support lib (in both cases
> > ;) )
> Any code changes will be coming your way, Ric...

Looking forward to it :)

> > Might be preferable over reinventing the wheel though. And for me a lot
> > quicker to implement stuff (unless there's volunteers out there?).
> If (and it is a big if), Antlr wanted to support the idea of "specify a
> parser with a Unicode source  character set, but the generated parser
> reads and parses the UTF-8 encoded stream representation"  I believe
> that I can offer the code that would make this automatic.
>
> For example:
> 	options { charVocabulary: Unicode-via-UTF-8; }
> 	...
> 	ALPHA_OMEGA: "\u0391\u03A9" | "\u03B1\u03C9" ;
> 	DASHES: '\u2010'..'\u2015' ;
>
> Would internally become:
> 	options { charVocabulary: '\u0000'..'\u00FF'; }
> 	...
> 	ALPHA_OMEGA: "\316\221\316\251" | "\316\261\317\211" ;
> 	DASHES: '\342' '\200' '\220'..'\225' ;
>
> The only hitch is that the user would have to probably up the value of
> k manually (I don't think I could or want to compute the "correct" new
> value.)  I have an algorithm working that works for these and more
> complicated cases as well. (It handles the XML 1.0 and XML 1.1 name
> productions, which are pretty hairy!)

It kindoff sounds like a bit of a 'hack' to support unicode for C++ (at
least that's my current impression from reading this?). It might definitely
be interesting to look at. But it might be better to build some more
structural support into the antlr lexer and antlr syntax.

> > And I also wonder what you'll get if you feed the lexer in java mode a
> > sequence that contains such a value broken up over two UTF-16 values
> > (that
> > for lexer terms should be treated as one!).
> Java prior to 1.5 is blissfully unaware.  It will think of a UTF-16
> surrogate pair as two characters.  In 1.5 it will start thinking of the
> type 'char' as UTF-16 code value, not a Unicode char.  Not clear how
> this will affect things, but I doubt they'll break any old APIs.
>
> What this means is that parsers currently built in Antlr really parse
> UTF-16 input, not Unicode.  So if you want to match U+1D11E (Musical
> Symbol G Clef), you have to match the string "\uD834\uDD1E".

Yup.

Cheers,

Ric
--
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893722  ----
-----+++++*****************************************************+++++++++-------
  "You can't expect to wield supreme executive power just because some
   watery tot throws a sword at you!"
  --- Monty Python and the Holy Grail



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list