[antlr-interest] unicode strings using supplemental char range

Sat Jun 26 20:48:34 PDT 2004

> ...That implies though that all this 1.5 UNICODE "fixing"
> won't be available to ANTLR itself nor the output parsers.  That
> implies that none of the supplemental codes will work for isUpperCase
> or isDigit etc...  I'm really at a loss to figure out how to proceed
> here.  I really appreciate all the feedback from people; eventually
> we'll figure this out.

I think the problem is much simpler than all this gnashing of teeth is 
making things out to be.

Ignoring for the moment Unicode specific extensions to Antlr, Antlr has 
very modest needs to be fully Unicode compliant:

[ Terminology: "char" = Java type, 16-bits, "character" = Unicode 
character in range 0..0x10FFFF, "String" = Java type, "unicode string" 
= a sequence of zero or more characters. ]

	- the ability to read a stream of characters
	- the ability to take a String written in a grammar file (i.e. 
"abc\u20ac123") and produce a unicode string from it (i.e. [ 97, 98, 
99, 8364, 49, 50, 41 ] - or if you mailer can handle it: [ 'a', 'b', 
'c', '€', '1, '2', '3' ])

Supporting these requirements hardly needs something as heavy weight as 
ICU for either Java or C++ parsers.  (ICU has things like calendar 
handling, regex matching, and number formatting in it!)  A simple class 
for the unicode string, an interface for the streaming protocol, a few 
implementations of the streaming interface, and a utility for 
de-escaping user written strings is all that is needed.

Note: The escape syntax for Antlr will probably need to be redesigned.  
"\u" followed by four hex digits doesn't cut it, though could be kept 
for backward compatibility.  It is probably best to bite the bullet and 
have a delimited escape sequence: "\U" followed by hex digits followed 
by ";".  Or if you want to look like the Unicode documentation 
standards, "\U+"...

Other features that have been discussed fall into two camps:

Features that are really not logically part of a lexer/parser package:
	- transcoding the input from a some encoding byte stream into a stream 
of characters
	- character sequence normalization
None of these should be part of Antlr (IMHO) and are easily handled as 
needed via re-implementing the streaming interface.

Features that might be possible nice utilities to have in a 
lexer/parser package:
	- case folding
	- Unicode character classes as pre-defined (or algorithmically 
defined) lexer rules
	- Unicode character blocks as pre-defined (or algorithmically defined) 
lexer rules
These may be nice, though Antlr has gotten along just fine until now 
without them.  I would heavily caution implementing these, or basing 
implementation issues on them until someone speaks up who would 
actually use them.  And even then, I caution adding large library needs 
to Antlr just to support optional features.

[ Personally, I might like to see some of the Unicode character classes 
(though, by no means all of them).  The Unicode blocks are useless to 
me, as is case folding.  I suspect for real world language parsing 
needs, unless a format is actually defined in terms of the Unicode 
properties, a grammar writer might prefer to explicitly declare their 
character sets in their lexer anyway. ]

	- Mark

Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/