[antlr-interest] Re: Problems with Unicode support in ANTLR

Thu May 16 15:05:41 PDT 2002

micheal_jor wrote:
> --- In antlr-interest at y..., Brian Smith <brian-l-smith at u...> wrote:
> 
> No Unicode blocks are a different concept from Unicode General 
> Categories. I don't think Java's standard libraries support Unicode 
> categories.

Okay, I see what you are talking about. Java's Character class does have 
support for some catagories; see 
http://java.sun.com/j2se/1.4/docs/api/java/lang/Character.html

Please look at the listed catagories and let me know if it is too 
limited. In particular, java.lang.Character.getType(), and the static 
final catagory constants.

>>I was thinking of patching ANTLR's Java generator to be able to use 
>>named unicode character catagories as "pre-defined" "protected" 
>> lexer rules, but supporting anything more than the Character class 
>> handles is over my head.
> 
> Thet would a useful addition - I mean the ability to define 
> such "preset" rules in ANTLR. I can do the work for Unicode 
> categories once the basic framework is in place.

> ter, is it OK to have ANTLR rely on additional libraries or would I 
> have to somehow port the Unicode required functionality into ANTLR 
> directly.

I would rather not have my Unicode-parsing application depend on IBM's 
library since I would have to distribute it. I think that the 
java.lang.Character class's support is sufficient.

Presumably, the modified ANTLR would then generate code like this:
     int type = Character.getType(LA(1));
     switch (type) {
        case Character.END_PUNCTUATION:
             mRULE(true);
             theRetToken=_returnToken;
             break;
        ....
     }

What do you think?

- Brian

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/