[antlr-interest] Re: Problems with Unicode support in ANTLR

Thu May 16 19:41:34 PDT 2002

--- In antlr-interest at y..., Terence Parr <parrt at j...> wrote:
> > Erm....Terrence are you there?  ;-)
> 
> Who me? ;)  I've been waiting to see what direction people would 
point 
> me. ;)  I've just looked at the source for Character.getType() and 
all 
> those wacky mysterious tables at the bottom of the Character.java 
source.

And now you've got me doing the same ;-)

The 
> only problem is that we'll have to convert LOWERCASE_LETTER to a 
> straight bitset that maps char -> yes/no if it's in that set.  So 
we 
> could do test LA(1), the next char of lookahead, against it.  Also, 
> don't forget that ANTLR needs to do the grammar analysis so that it 
can 
> determine if you have a nondetermism.  For example, I presume the 
> following would be ambiguous/nondeterministic:
> 
> DUH : LOWERCASE_LETTER | BASIC_LATIN | BENGALI ;

Perhaps a restriction on combining pre-defined UnicodeBlocks and 
UnicodeCategories in this manner since they rarely make sense 
together?

I see UnicodeBlock as being more useful in setting CharVocabulary 
option although that may just be because I'm relatively unfamilair 
with them. In fact I think that is what it is.....ignore me.

> I believe I could predefine all these character categories and then 
> simply let you refer to them.  Note that for every set you 
reference 
> though would be a potentially very big uncompressed bitset.  Worst 
case, 
> if you have digit \uFFFE defined, you'd have about 65 kilobits in 
the 
> set or about 8k per bitset.  Every time I have to combine this set 
with 
> a character or another set, that's another 8k worst case.  I could 
> probably make a simple compression that ignored long runs of zeros 
on 
> the front of the bit set.

Perhaps a separation of BitSet's implementation from it's interface 
so one could create a SparseRangeBitSet implementation that simply 
stores a list of the ranges of "YES" characters (or NO depending 
which is more compact) permitted as perhaps two int arrays and only 
generates a "traditional" BitSet if it is essential to do so.

> Is this the kind of thing you would need?  I.e., to be able to 
> specifically refer to predefined Java character blocks as 
predefined 
> ANTLR "characters"?

Yes, that would a very good start for better Unicode support in ANTLR.

Cheers,

Micheal

PS     Would you be kind enough to look at my original post please?
       I would like your opinion on choosing between (b) and (c) and, 
the questions I raised about both [ e.g. how to shut off the warnings 
on (b) ]

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/