[antlr-interest] unicode support
Terence Parr
parrt at jguru.com
Mon Dec 16 14:51:22 PST 2002
Folks,
At this point, I think ANTLR unicode support looks like this:
1. use a UTF-8 decoder or whatever with Java to get a Reader to give to
"new Lexer(reader)".
2. use '\u...' characters in your grammar to match unicode chars and a
unicode range like '\u0000'..'\uFFFE' in charVocabulary option.
3. ANTLR 2.7.2 generates bitsets in a better way so that classes don't
explode with static data (JDK 1.4 wouldn't load some classes for
example).
A few things that would be interesting to add:
Allow you to reference sets like JAVA_IDENTIFIER or LATIN_... and then
characters like 'GREATER-THAN SIGN' and 'APOSTROPHE-QUOTE'. The later
would be easy: just a hashtable lookup if I can find the unicode char
index in Java somewhere ;) The former is harder as there is nothing in
Java's Character.java class that lets me get a set of chars for say
GREEK_EXTENDED. Anybody know a good library that would give me a set
of chars from these char class names? I've just found:
http://oss.software.ibm.com/icu/userguide/unicodeSet.html
which might work. It seems to have a UnicodeSet, but the problem is
that ANTLR would then depend on this other library that we have no
control over. ;( Anybody have a solution? We need a mapping like:
GREEK_EXTENDED -> set of chars
and the character mapping like:
APOSTROPHE-QUOTE -> char
I can convert a table to Java with a shell script probably if we can
find a convenient table.
These ideas would work for charVocabulary and for just referencing them
in lexer grammars.
If this is easy to do, I'll try to pop it into 2.7.2 before I release.
Ter
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Lecturer in Comp. Sci., University of San Francisco
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list