[antlr-interest] unicode support

Tue Dec 17 13:31:06 PST 2002

oops! I meant UCS4 not UCS2 (16bit)

JOhn

-----Original Message-----
From: John Lambert [mailto:jlambert at nwlink.com]
Sent: Tuesday, December 17, 2002 12:50 PM
To: antlr-interest at yahoogroups.com
Subject: RE: [antlr-interest] unicode support

I would also recommend the use of ICU, I think this is now the de facto
standard
package for both C++ and java.
It would also be quite probable that anyone producing a Unicode application
would
already be using the ICU package.

Please allow the full Unicode 3.2 specification range, it can be represented
in any
of 3 formats:
UCS2, UTF16 and UTF8.

UTF8 is probably the best format for input/output but internally you may
wish to convert
to UCS2 in the C++ code at least.

There is also the format '\U00000000' to '\U001FFFFF' for the non BMP range.

John Lambert

-----Original Message-----
From: Terence Parr [mailto:parrt at jguru.com]
Sent: Monday, December 16, 2002 2:51 PM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] unicode support

Folks,

At this point, I think ANTLR unicode support looks like this:

1. use a UTF-8 decoder or whatever with Java to get a Reader to give to
"new Lexer(reader)".
2. use '\u...' characters in your grammar to match unicode chars and a
unicode range like '\u0000'..'\uFFFE' in charVocabulary option.
3. ANTLR 2.7.2 generates bitsets in a better way so that classes don't
explode with static data (JDK 1.4 wouldn't load some classes for
example).

A few things that would be interesting to add:

Allow you to reference sets like JAVA_IDENTIFIER or LATIN_... and then
characters like 'GREATER-THAN SIGN' and 'APOSTROPHE-QUOTE'.  The later
would be easy: just a hashtable lookup if I can find the unicode char
index in Java somewhere ;)  The former is harder as there is nothing in
Java's Character.java class that lets me get a set of chars for say
GREEK_EXTENDED.  Anybody know a good library that would give me a set
of chars from these char class names?  I've just found:

http://oss.software.ibm.com/icu/userguide/unicodeSet.html

which might work.  It seems to have a UnicodeSet, but the problem is
that ANTLR would then depend on this other library that we have no
control over. ;(  Anybody have a solution?  We need a mapping like:

GREEK_EXTENDED -> set of chars

and the character mapping like:

APOSTROPHE-QUOTE -> char

I can convert a table to Java with a shell script probably if we can
find a convenient table.

These ideas would work for charVocabulary and for just referencing them
in lexer grammars.

If this is easy to do, I'll try to pop it into 2.7.2 before I release.

Ter
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Lecturer in Comp. Sci., University of San Francisco

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/