[antlr-interest] Re: unicode support
Brian Smith
brian-l-smith at uiowa.edu
Tue Dec 17 18:17:38 PST 2002
> <open.zone at virgin.net> wrote:
>>>stick at version 3.0 which is the last 16 bit version. Current
>>>Unicode uses 21 bits but Java does not grok it.
They can be converted to surrogate pairs, I guess.
Terence Parr wrote:
> Yep, that table would be easy to convert into char defs. It would take
> time and space to load that hashtable each time I start up antlr though
> just in case somebody referenced
>
> 'ORIYA LETTER DDHA'
>
> but it would be pretty cool. Could do it on demand if they reference
> any 'blah' that is not a single char. I'd load the table and look it
> up. Any ideas making this faster? It's 12000 hashtable entries ;)
> Technically the charVocabulary could limit what I had to load, but...
Is charVocabulary useful? My understanding is that it is purely an
optimization technique. But, I thought that ANTLR has a new way of doing
token set optimizations now?
Anyway, I think that providing support for 'ORIYA LETTER DDHA' is a bit
over the top for now. My understanding is that many people would be
satisfied if Unicode general catagories, Unicode blocks, and maybe
Unicode directionality catagories were supported. Then, I believe that
you wouldn't have to worry so much about these optimizations.
With the technique described below, you could have one grammar file for
each unicode block (or really, whatever catagorization you want). For
example, the 'OriyaLexer' grammar would contain all 'Oriya ***' rules
and 'ThaiLexer' would contain map all 'Thai ****' letters to rules.
> I think they ran 1..0xFFFF into isLetter(...) and got the ranges. If I
> defined this one alone or isLetterOrDigit it would be a big improvement
> ;)
I very much agree.
> 2. If I have the stomach before release of 2.7.2, I try to add *simple*
> UNICODE support. Here is how sets could be done. If you reference a
> word like SPACE_SEPARATOR and there is no rule with that name in the
> lexer, it will assume it is a Unicode set and try to load it
> (intersected with charVocabulary???). I'd prefer a syntactic indicator
> that you meant a unicode set rather than a rule, but I can't think of
> anything obvious at the moment. Anyway, "load" means lookahead Java
I like the thought of unicode catagories, etc. as rules. IMO, ANTLR
should treat the Unicode stuff just like another lexical grammar. Then
you can just import the "UnicodeGeneralCatagories" grammar into your own
grammar when you need it.
I guess the problem is that ANTLR does not provide a way for grammar A
to reuse multiple grammars at once. For example, I would like to say:
class MyLexer imports UnicodeGeneralGatagories, XMLCatagories;
And then have MyLexer be able to use rules from UnicodeGeneralCatagories
and XMLCatagories, just as though it extended both of them at once.
Then, you could just generate a single UnicodeGeneralCatgories.g that
defined a bunch of protected rules, one for each catagory.
I think this more general mechanism (replacing or augmenting grammar
inheritance with grammar importation) would also have other benefits and
would be useful both in parsers and lexers. It would basically mean that
we would be creating reusable grammar libraries. Currently you can
somewhat do that with grammar inheritance but you are limited to reusing
one grammar at a time.
> class antlr.unicode.SPACE_SEPARATOR which is a BitSet subclass. If
> found, make an instance and just use it. Nothing else in ANTLR will
> care or notice. All code generators will still work (at least for
> generating char sets in the output lexers). Pretty cool, eh? This
> way, I can provide a few simple sets like LETTER_DIGIT and you all are
> free to define antlr.unicode.MY_FAVORITE_UNICODE_SET w/o having to
> touch the antlr source code. Just place that in your CLASSPATH. I'll
> provide the tool to pull stuff out of the unicode table for
> convenience. THis mechanism makes the classloader the "hashtable" to
> map set name to bit set. Clever, no?
Yes, but it would be best if I could define MY_FAVORITE_UNICODE_SET
using an ANTLR grammar specification. Otherwise, you will have to define
MY_FAVORITE_UNICODE_SET using Java, even if you are using the C# or C++
generators? It seems odd.
> The last question I have is "how do I know the complete set for a
> UnicodeSet like Mongolian"? I can use Character.getType(...) to find
>
> Since LATIN is way lower than 1E00...i've no idea what a "RING BELOW"
> is either ;) How can I find the complete set of chars for a language?
> Ah ha!
>
> http://www.unicode.org/Public/UNIDATA/Blocks.txt
>
> So, this list is so small that ranges for a single language can be put
> into a quicky hashtable. you could say
>
> options {
> charVocabulary = THAI;
> }
Again, in most cases I have seen, I have to set the charVocabularity to
always be all of Unicode. Lexers simply don't do what you expect them to
if they are given a character outside the charVocabulary. Therefore, the
only real way to make your lexer robust is to ensure that the input does
not ever contain characters outside the charVocabulary. And the easiest
and most robust way to do that is to set charVocabulary as large as
possible (i.e. all characters).
>
> However, how do you limit digits to, say, THAI letters?
>
> ID : (THAI)+ ; // includes too much!
>
> Hmm...might need to introduce an intersection operator that let you say:
>
> ID : (THAI & LETTER)* (THAI & (LETTER|DIGIT))+ ;
This seems like a very useful thing to have, not just for unicode
support but for lexer (and parser?) rules in general. I think SableCC
also has subtraction.
> Uh...getting complicated. How about I provide the tool to let you
> predefine THAI_LETTER and THAI_DIGIT given the THAI range from this
> table and the Character.isDigit method etc... from Java?? That will
> work to start ;)
I like the intersection operator better. Anyway, I think that in the
short term there is not much demand for this "Tha Letter" scenerio, and
there are workarounds (especially if the unicode block/catagory support
is added) that could be used until the intersection operator is implemented.
- Brian
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list