[antlr-interest] unicode support

Mon Dec 16 14:51:22 PST 2002

Folks,

At this point, I think ANTLR unicode support looks like this:

1. use a UTF-8 decoder or whatever with Java to get a Reader to give to 
"new Lexer(reader)".
2. use '\u...' characters in your grammar to match unicode chars and a 
unicode range like '\u0000'..'\uFFFE' in charVocabulary option.
3. ANTLR 2.7.2 generates bitsets in a better way so that classes don't 
explode with static data (JDK 1.4 wouldn't load some classes for 
example).

A few things that would be interesting to add:

Allow you to reference sets like JAVA_IDENTIFIER or LATIN_... and then 
characters like 'GREATER-THAN SIGN' and 'APOSTROPHE-QUOTE'.  The later 
would be easy: just a hashtable lookup if I can find the unicode char 
index in Java somewhere ;)  The former is harder as there is nothing in 
Java's Character.java class that lets me get a set of chars for say 
GREEK_EXTENDED.  Anybody know a good library that would give me a set 
of chars from these char class names?  I've just found:

http://oss.software.ibm.com/icu/userguide/unicodeSet.html

which might work.  It seems to have a UnicodeSet, but the problem is 
that ANTLR would then depend on this other library that we have no 
control over. ;(  Anybody have a solution?  We need a mapping like:

GREEK_EXTENDED -> set of chars

and the character mapping like:

APOSTROPHE-QUOTE -> char

I can convert a table to Java with a shell script probably if we can 
find a convenient table.

These ideas would work for charVocabulary and for just referencing them 
in lexer grammars.

If this is easy to do, I'll try to pop it into 2.7.2 before I release.

Ter
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Lecturer in Comp. Sci., University of San Francisco

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/