[antlr-interest] Re: unicode support

Tue Dec 17 18:17:38 PST 2002

> <open.zone at virgin.net> wrote:
>>>stick at version 3.0 which is the last 16 bit version.  Current
>>>Unicode uses 21 bits but Java does not grok it.

They can be converted to surrogate pairs, I guess.

Terence Parr wrote:
> Yep, that table would be easy to convert into char defs.  It would take 
> time and space to load that hashtable each time I start up antlr though 
> just in case somebody referenced
> 
> 'ORIYA LETTER DDHA'
> 
> but it would be pretty cool.  Could do it on demand if they reference 
> any 'blah' that is not a single char.  I'd load the table and look it 
> up.  Any ideas making this faster?  It's 12000 hashtable entries ;)  
> Technically the charVocabulary could limit what I had to load, but...

Is charVocabulary useful? My understanding is that it is purely an 
optimization technique. But, I thought that ANTLR has a new way of doing 
token set optimizations now?

Anyway, I think that providing support for 'ORIYA LETTER DDHA' is a bit 
over the top for now. My understanding is that many people would be 
satisfied if Unicode general catagories, Unicode blocks, and maybe 
Unicode directionality catagories were supported. Then, I believe that 
you wouldn't have to worry so much about these optimizations.

With the technique described below, you could have one grammar file for 
each unicode block (or really, whatever catagorization you want). For 
example, the 'OriyaLexer' grammar would contain all 'Oriya ***' rules 
and 'ThaiLexer' would contain map all 'Thai ****' letters to rules.

> I think they ran 1..0xFFFF into isLetter(...) and got the ranges.  If I 
> defined this one alone or isLetterOrDigit it would be a big improvement 
> ;)

I very much agree.

> 2. If I have the stomach before release of 2.7.2, I try to add *simple* 
> UNICODE support.  Here is how sets could be done.  If you reference a 
> word like SPACE_SEPARATOR and there is no rule with that name in the 
> lexer, it will assume it is a Unicode set and try to load it 
> (intersected with charVocabulary???).  I'd prefer a syntactic indicator 
> that you meant a unicode set rather than a rule, but I can't think of 
> anything obvious at the moment.  Anyway, "load" means lookahead Java 

I like the thought of unicode catagories, etc. as rules. IMO, ANTLR 
should treat the Unicode stuff just like another lexical grammar. Then 
you can just import the "UnicodeGeneralCatagories" grammar into your own 
grammar when you need it.

I guess the problem is that ANTLR does not provide a way for grammar A 
to reuse multiple grammars at once. For example, I would like to say:

class MyLexer imports UnicodeGeneralGatagories, XMLCatagories;

And then have MyLexer be able to use rules from UnicodeGeneralCatagories 
and XMLCatagories, just as though it extended both of them at once.

Then, you could just generate a single UnicodeGeneralCatgories.g that 
defined a bunch of protected rules, one for each catagory.

I think this more general mechanism (replacing or augmenting grammar 
inheritance with grammar importation) would also have other benefits and 
would be useful both in parsers and lexers. It would basically mean that 
we would be creating reusable grammar libraries. Currently you can 
somewhat do that with grammar inheritance but you are limited to reusing 
one grammar at a time.

> class antlr.unicode.SPACE_SEPARATOR which is a BitSet subclass.  If 
> found, make an instance and just use it.  Nothing else in ANTLR will 
> care or notice.  All code generators will still work (at least for 
> generating char sets in the output lexers).  Pretty cool, eh?  This 
> way, I can provide a few simple sets like LETTER_DIGIT and you all are 
> free to define antlr.unicode.MY_FAVORITE_UNICODE_SET w/o having to 
> touch the antlr source code.  Just place that in your CLASSPATH.  I'll 
> provide the tool to pull stuff out of the unicode table for 
> convenience.  THis mechanism makes the classloader the "hashtable" to 
> map set name to bit set.  Clever, no?

Yes, but it would be best if I could define MY_FAVORITE_UNICODE_SET 
using an ANTLR grammar specification. Otherwise, you will have to define 
MY_FAVORITE_UNICODE_SET using Java, even if you are using the C# or C++ 
generators? It seems odd.

> The last question I have is "how do I know the complete set for a 
> UnicodeSet like Mongolian"?  I can use Character.getType(...) to find 
> 
> Since LATIN is way lower than 1E00...i've no idea what a "RING BELOW" 
> is either ;)  How can I find the complete set of chars for a language?  
> Ah ha!
> 
> http://www.unicode.org/Public/UNIDATA/Blocks.txt
 >
> So, this list is so small that ranges for a single language can be put 
> into a quicky hashtable.  you could say
> 
> options {
> 	charVocabulary = THAI;
> }

Again, in most cases I have seen, I have to set the charVocabularity to 
always be all of Unicode. Lexers simply don't do what you expect them to 
if they are given a character outside the charVocabulary. Therefore, the 
only real way to make your lexer robust is to ensure that the input does 
not ever contain characters outside the charVocabulary. And the easiest 
and most robust way to do that is to set charVocabulary as large as 
possible (i.e. all characters).

> 
> However, how do you limit digits to, say, THAI letters?
> 
> ID : (THAI)+ ; // includes too much!
> 
> Hmm...might need to introduce an intersection operator that let you say:
> 
> ID : (THAI & LETTER)* (THAI & (LETTER|DIGIT))+ ;

This seems like a very useful thing to have, not just for unicode 
support but for lexer (and parser?) rules in general. I think SableCC 
also has subtraction.

> Uh...getting complicated.  How about I provide the tool to let you 
> predefine THAI_LETTER and THAI_DIGIT given the THAI range from this 
> table and the Character.isDigit method etc... from Java??  That will 
> work to start ;)

I like the intersection operator better. Anyway, I think that in the 
short term there is not much demand for this "Tha Letter" scenerio, and 
there are workarounds (especially if the unicode block/catagory support 
is added) that could be used until the intersection operator is implemented.

- Brian

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/