[antlr-interest] unicode strings using supplemental char range

Mon Jun 28 05:20:12 PDT 2004

On Sat, Jun 26, 2004 at 08:48:34PM -0700, Mark Lentczner wrote:
> [ Terminology: "char" = Java type, 16-bits, "character" = Unicode
> character in range 0..0x10FFFF, "String" = Java type, "unicode string"
> = a sequence of zero or more characters. ]
>
> 	- the ability to read a stream of characters
> 	- the ability to take a String written in a grammar file (i.e.
> "abc\u20ac123") and produce a unicode string from it (i.e. [ 97, 98,
> 99, 8364, 49, 50, 41 ] - or if you mailer can handle it: [ 'a', 'b',
> 'c', '?', '1, '2', '3' ])
>
> Supporting these requirements hardly needs something as heavy weight as
> ICU for either Java or C++ parsers.  (ICU has things like calendar
> handling, regex matching, and number formatting in it!)

It's a pity they don't have a 'light' version indeed. That's also why I'm
contemplating making some lightweight classes for some aspects we need
inside ANTLR for the C++ side.

> A simple class for the unicode string, an interface for the streaming
> protocol, a few implementations of the streaming interface, and a utility
> for de-escaping user written strings is all that is needed.
>
> Note: The escape syntax for Antlr will probably need to be redesigned.
> "\u" followed by four hex digits doesn't cut it, though could be kept
> for backward compatibility.  It is probably best to bite the bullet and
> have a delimited escape sequence: "\U" followed by hex digits followed
> by ";".  Or if you want to look like the Unicode documentation
> standards, "\U+"...

Guess I'd prefer the \U[0-9]+; syntax since it has a closing char (didn't
notice one in the \u+ syntax although I did not look to well for it).

> Other features that have been discussed fall into two camps:
>
> Features that are really not logically part of a lexer/parser package:
> 	- transcoding the input from a some encoding byte stream into a stream
> of characters
> 	- character sequence normalization
> None of these should be part of Antlr (IMHO) and are easily handled as
> needed via re-implementing the streaming interface.

I'm not sure but don't we need this on the C++ side? We have no java to
this stuff for us. So at least I'd like to be able to feed a UTF-8 string
to ANTLR out of the box without deriving some custom classes. Live is a lot
easier on the lexer side if it only has to deal with unicode characters. So
some input stream decoding will be necessary. On second thought this
probably does not need transcoding/normalization just the decoding for
UTF-xx.

Question: How to deal with the unicode characters that are beyond 32 bits
they'd need a more expensive struct to have all bits or do we have to make
the lexers operate on UTF-32.

Question: How do we deal with this codegen wise? Can this choice be delayed
until the codegen or does it need to be resolved at analysis time?

> Features that might be possible nice utilities to have in a
> lexer/parser package:
> 	- case folding
> 	- Unicode character classes as pre-defined (or algorithmically
> defined) lexer rules
> 	- Unicode character blocks as pre-defined (or algorithmically defined)
> lexer rules
> These may be nice, though Antlr has gotten along just fine until now
> without them.  I would heavily caution implementing these, or basing
> implementation issues on them until someone speaks up who would
> actually use them.  And even then, I caution adding large library needs
> to Antlr just to support optional features.

It's probably an idea to have some include file mechanism in which we can
define and import these classes/blocks/encodings. That way we can just
start with a bare minimum and people can submit their added versions
lateron for distro inclusion. Or with some luck we can generate these
include files from some stuff from ICU (I recall seeing some tables or
something in their distro)

On a general note it be nice if antlr 3 supported some character range
macro's definition syntax. Doing things via protected rules is sometimes a
bit 'overkill' (e.g. it implies extra function calls where they just could
have been included in a switch).

Cheers,

Ric
--
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893755  ----
-----+++++*****************************************************+++++++++-------
 "Don't call me stupid." "Oh, right. To call you stupid would be an insult
    to stupid people. I've known sheep that could outwit you! I've worn
              dresses with higher IQs!" --- A Fish Called Wanda

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/