[antlr-interest] unicode strings using supplemental char range
Mark Lentczner
markl at glyphic.com
Sat Jun 26 20:48:34 PDT 2004
> ...That implies though that all this 1.5 UNICODE "fixing"
> won't be available to ANTLR itself nor the output parsers. That
> implies that none of the supplemental codes will work for isUpperCase
> or isDigit etc... I'm really at a loss to figure out how to proceed
> here. I really appreciate all the feedback from people; eventually
> we'll figure this out.
I think the problem is much simpler than all this gnashing of teeth is
making things out to be.
Ignoring for the moment Unicode specific extensions to Antlr, Antlr has
very modest needs to be fully Unicode compliant:
[ Terminology: "char" = Java type, 16-bits, "character" = Unicode
character in range 0..0x10FFFF, "String" = Java type, "unicode string"
= a sequence of zero or more characters. ]
- the ability to read a stream of characters
- the ability to take a String written in a grammar file (i.e.
"abc\u20ac123") and produce a unicode string from it (i.e. [ 97, 98,
99, 8364, 49, 50, 41 ] - or if you mailer can handle it: [ 'a', 'b',
'c', '€', '1, '2', '3' ])
Supporting these requirements hardly needs something as heavy weight as
ICU for either Java or C++ parsers. (ICU has things like calendar
handling, regex matching, and number formatting in it!) A simple class
for the unicode string, an interface for the streaming protocol, a few
implementations of the streaming interface, and a utility for
de-escaping user written strings is all that is needed.
Note: The escape syntax for Antlr will probably need to be redesigned.
"\u" followed by four hex digits doesn't cut it, though could be kept
for backward compatibility. It is probably best to bite the bullet and
have a delimited escape sequence: "\U" followed by hex digits followed
by ";". Or if you want to look like the Unicode documentation
standards, "\U+"...
Other features that have been discussed fall into two camps:
Features that are really not logically part of a lexer/parser package:
- transcoding the input from a some encoding byte stream into a stream
of characters
- character sequence normalization
None of these should be part of Antlr (IMHO) and are easily handled as
needed via re-implementing the streaming interface.
Features that might be possible nice utilities to have in a
lexer/parser package:
- case folding
- Unicode character classes as pre-defined (or algorithmically
defined) lexer rules
- Unicode character blocks as pre-defined (or algorithmically defined)
lexer rules
These may be nice, though Antlr has gotten along just fine until now
without them. I would heavily caution implementing these, or basing
implementation issues on them until someone speaks up who would
actually use them. And even then, I caution adding large library needs
to Antlr just to support optional features.
[ Personally, I might like to see some of the Unicode character classes
(though, by no means all of them). The Unicode blocks are useless to
me, as is case folding. I suspect for real world language parsing
needs, unless a format is actually defined in terms of the Unicode
properties, a grammar writer might prefer to explicitly declare their
character sets in their lexer anyway. ]
- Mark
Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list