[antlr-interest] Re: unicode support

Tue Dec 17 16:52:28 PST 2002

On Tuesday, December 17, 2002, at 07:45  AM, micheal_jor 
<open.zone at virgin.net> wrote:

> --- In antlr-interest at yahoogroups.com, Pete Forman <pete.forman at w...>
> wrote:
>> At 2002-12-16 14:51 -0800, Terence Parr wrote:
>>> I can convert a table to Java with a shell script probably if we
> can
>>> find a convenient table.
>>
>> http://www.unicode.org/Public/UNIDATA/ReadMe.txt
>>
>> That is for the current version, i.e. Unicode 3.2.  You might wish
> to
>> stick at version 3.0 which is the last 16 bit version.  Current
>> Unicode uses 21 bits but Java does not grok it.
>
> What worked for me in the past:
>
> I imported the http://www.unicode.org/Public/3.1-Update/UnicodeData-
> 3.1.0.txt text file into a database and wrote simple queries to dump
> the list of char-values and char-ranges for each UnicodeCategory. I
> used MS SQL Server and MS Access as a prototyping-friendly front end
> to write all the queries/formatting code.

Yep, that table would be easy to convert into char defs.  It would take 
time and space to load that hashtable each time I start up antlr though 
just in case somebody referenced

'ORIYA LETTER DDHA'

but it would be pretty cool.  Could do it on demand if they reference 
any 'blah' that is not a single char.  I'd load the table and look it 
up.  Any ideas making this faster?  It's 12000 hashtable entries ;)  
Technically the charVocabulary could limit what I had to load, but...

Heh, what if I built the hashtable and then serialized it via Java?  It 
used to be horribly slow (and there are version issues), but would that 
load fast enough?  Just thinking of the 0-initialization of all that 
data makes me think it will be slow.

Is it ok if I use

http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt

instead of 3.1?

> In any case, this strategy should work with other RDBMSes as long as
> what you want is the char-values and char-ranges of the
> UnicodeCategory-ies. Otherwise Character.getType(char ch) should tell
> you what UnicodeCategory a given char belongs to.

That seems to be the strategy that people use.  For example, SableCC's 
Java grammar has this:

     unicode_letter =
         [0x0041..0x005a] | [0x0061..0x007a] | [0x00aa..0x00aa] | 
[0x00b5..0x00b5] |
         [0x00ba..0x00ba] | [0x00c0..0x00d6] | [0x00d8..0x00f6] | 
[0x00f8..0x01f5] |
	...

I think they ran 1..0xFFFF into isLetter(...) and got the ranges.  If I 
defined this one alone or isLetterOrDigit it would be a big improvement 
;)

Ok, so how is this for a plan.

1. Walk thru and update manual in prep for 2.7.2 (just cleaning up best 
I can for quick release).

2. If I have the stomach before release of 2.7.2, I try to add *simple* 
UNICODE support.  Here is how sets could be done.  If you reference a 
word like SPACE_SEPARATOR and there is no rule with that name in the 
lexer, it will assume it is a Unicode set and try to load it 
(intersected with charVocabulary???).  I'd prefer a syntactic indicator 
that you meant a unicode set rather than a rule, but I can't think of 
anything obvious at the moment.  Anyway, "load" means lookahead Java 
class antlr.unicode.SPACE_SEPARATOR which is a BitSet subclass.  If 
found, make an instance and just use it.  Nothing else in ANTLR will 
care or notice.  All code generators will still work (at least for 
generating char sets in the output lexers).  Pretty cool, eh?  This 
way, I can provide a few simple sets like LETTER_DIGIT and you all are 
free to define antlr.unicode.MY_FAVORITE_UNICODE_SET w/o having to 
touch the antlr source code.  Just place that in your CLASSPATH.  I'll 
provide the tool to pull stuff out of the unicode table for 
convenience.  THis mechanism makes the classloader the "hashtable" to 
map set name to bit set.  Clever, no?

The last question I have is "how do I know the complete set for a 
UnicodeSet like Mongolian"?  I can use Character.getType(...) to find 
out the *category*, but what about the vocabulary sets like Mongolian?  
It doesn't seem as simple as pulling out the first word from the name 
in the file:

18A8;MONGOLIAN LETTER MANCHU ALI GALI BHA;Lo;0;L;;;;;N;;;;;
18A9;MONGOLIAN LETTER ALI GALI DAGALGA;Mn;228;NSM;;;;;N;;;;;
1E00;LATIN CAPITAL LETTER A WITH RING BELOW;Lu;0;L;0041 
0325;;;;N;;;;1E01;

Since LATIN is way lower than 1E00...i've no idea what a "RING BELOW" 
is either ;)  How can I find the complete set of chars for a language?  
Ah ha!

http://www.unicode.org/Public/UNIDATA/Blocks.txt

has a list:

# Blocks-3.2.0.txt
# Correlated with Unicode 3.2
# Start Code..End Code; Block Name
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
0180..024F; Latin Extended-B
0250..02AF; IPA Extensions
02B0..02FF; Spacing Modifier Letters
0300..036F; Combining Diacritical Marks
0370..03FF; Greek and Coptic
0400..04FF; Cyrillic
0500..052F; Cyrillic Supplementary
0530..058F; Armenian
0590..05FF; Hebrew
0600..06FF; Arabic
0700..074F; Syriac
0780..07BF; Thaana
0900..097F; Devanagari
0980..09FF; Bengali
0A00..0A7F; Gurmukhi
...

So, this list is so small that ranges for a single language can be put 
into a quicky hashtable.  you could say

options {
	charVocabulary = THAI;
}

However, how do you limit digits to, say, THAI letters?

ID : (THAI)+ ; // includes too much!

Hmm...might need to introduce an intersection operator that let you say:

ID : (THAI & LETTER)* (THAI & (LETTER|DIGIT))+ ;

Uh...getting complicated.  How about I provide the tool to let you 
predefine THAI_LETTER and THAI_DIGIT given the THAI range from this 
table and the Character.isDigit method etc... from Java??  That will 
work to start ;)

Please comment on the above ramblings if you think I'm on the wrong or 
right track. :)

If we get this right, we'll be the first free parser generator that 
gets this UNICODE "right" I'd say ;)

Ter
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Lecturer in Comp. Sci., University of San Francisco

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/