[antlr-interest] unicode support
John Lambert
jlambert at nwlink.com
Tue Dec 17 13:31:06 PST 2002
oops! I meant UCS4 not UCS2 (16bit)
JOhn
-----Original Message-----
From: John Lambert [mailto:jlambert at nwlink.com]
Sent: Tuesday, December 17, 2002 12:50 PM
To: antlr-interest at yahoogroups.com
Subject: RE: [antlr-interest] unicode support
I would also recommend the use of ICU, I think this is now the de facto
standard
package for both C++ and java.
It would also be quite probable that anyone producing a Unicode application
would
already be using the ICU package.
Please allow the full Unicode 3.2 specification range, it can be represented
in any
of 3 formats:
UCS2, UTF16 and UTF8.
UTF8 is probably the best format for input/output but internally you may
wish to convert
to UCS2 in the C++ code at least.
There is also the format '\U00000000' to '\U001FFFFF' for the non BMP range.
John Lambert
-----Original Message-----
From: Terence Parr [mailto:parrt at jguru.com]
Sent: Monday, December 16, 2002 2:51 PM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] unicode support
Folks,
At this point, I think ANTLR unicode support looks like this:
1. use a UTF-8 decoder or whatever with Java to get a Reader to give to
"new Lexer(reader)".
2. use '\u...' characters in your grammar to match unicode chars and a
unicode range like '\u0000'..'\uFFFE' in charVocabulary option.
3. ANTLR 2.7.2 generates bitsets in a better way so that classes don't
explode with static data (JDK 1.4 wouldn't load some classes for
example).
A few things that would be interesting to add:
Allow you to reference sets like JAVA_IDENTIFIER or LATIN_... and then
characters like 'GREATER-THAN SIGN' and 'APOSTROPHE-QUOTE'. The later
would be easy: just a hashtable lookup if I can find the unicode char
index in Java somewhere ;) The former is harder as there is nothing in
Java's Character.java class that lets me get a set of chars for say
GREEK_EXTENDED. Anybody know a good library that would give me a set
of chars from these char class names? I've just found:
http://oss.software.ibm.com/icu/userguide/unicodeSet.html
which might work. It seems to have a UnicodeSet, but the problem is
that ANTLR would then depend on this other library that we have no
control over. ;( Anybody have a solution? We need a mapping like:
GREEK_EXTENDED -> set of chars
and the character mapping like:
APOSTROPHE-QUOTE -> char
I can convert a table to Java with a shell script probably if we can
find a convenient table.
These ideas would work for charVocabulary and for just referencing them
in lexer grammars.
If this is easy to do, I'll try to pop it into 2.7.2 before I release.
Ter
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Lecturer in Comp. Sci., University of San Francisco
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list