[antlr-interest] Re: Problems with Unicode support in ANTLR

Thu May 16 18:53:39 PDT 2002

> Erm....Terrence are you there?  ;-)

Who me? ;)  I've been waiting to see what direction people would point 
me. ;)  I've just looked at the source for Character.getType() and all 
those wacky mysterious tables at the bottom of the Character.java source.

Recall that ANTLR generates bitsets on its own.  If you say (' '|'\t') 
or 'a'..'z' you'll see that antlr tests LA(1) is member "some bitset 
defined in bottom of lexer file".  What we need is for ANTLR to be aware 
of the standard categories (UnicodeBlocks) like LOWERCASE_LETTER.  The 
only problem is that we'll have to convert LOWERCASE_LETTER to a 
straight bitset that maps char -> yes/no if it's in that set.  So we 
could do test LA(1), the next char of lookahead, against it.  Also, 
don't forget that ANTLR needs to do the grammar analysis so that it can 
determine if you have a nondetermism.  For example, I presume the 
following would be ambiguous/nondeterministic:

DUH : LOWERCASE_LETTER | BASIC_LATIN | BENGALI ;

I believe I could predefine all these character categories and then 
simply let you refer to them.  Note that for every set you reference 
though would be a potentially very big uncompressed bitset.  Worst case, 
if you have digit \uFFFE defined, you'd have about 65 kilobits in the 
set or about 8k per bitset.  Every time I have to combine this set with 
a character or another set, that's another 8k worst case.  I could 
probably make a simple compression that ignored long runs of zeros on 
the front of the bit set.

Is this the kind of thing you would need?  I.e., to be able to 
specifically refer to predefined Java character blocks as predefined 
ANTLR "characters"?

Ter

On Thursday, May 16, 2002, at 06:29  PM, micheal_jor wrote:

>
>> Okay, I see what you are talking about. Java's Character class does
> have
>> support for some catagories; see
>> http://java.sun.com/j2se/1.4/docs/api/java/lang/Character.html
>>
>> Please look at the listed catagories and let me know if it is too
>> limited. In particular, java.lang.Character.getType(), and the
> static
>> final catagory constants.
>
> I saw the static constants but could see that they were used
> anywhere. Not surprisingly, I don't believe someone actually
> thought "getType()" makes sense as the accessor for a character's
> Unicode General Category -- what happened to getCategory() or
> getGeneralCategory()?. Sheez!
>
> In any case, you are right that the feature is supported.
>
>
>> I would rather not have my Unicode-parsing application depend on
> IBM's
>> library since I would have to distribute it. I think that the
>> java.lang.Character class's support is sufficient.
>
> For the feature we've discussed fo far, yes it is. The license for
> IBM's package doesn't forbid extracting what we need into ANTLR if
> memory serves.
>
>> Presumably, the modified ANTLR would then generate code like this:
>>      int type = Character.getType(LA(1));
>>      switch (type) {
>>         case Character.END_PUNCTUATION:
>>              mRULE(true);
>>              theRetToken=_returnToken;
>>              break;
>>         ....
>>      }
>>
>
> Erm....Terrence are you there?  ;-)
>
> Micheal
>
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/