[antlr-interest] Re: Problems with Unicode support in ANTLR
micheal_jor
open.zone at virgin.net
Thu May 16 19:41:34 PDT 2002
--- In antlr-interest at y..., Terence Parr <parrt at j...> wrote:
> > Erm....Terrence are you there? ;-)
>
> Who me? ;) I've been waiting to see what direction people would
point
> me. ;) I've just looked at the source for Character.getType() and
all
> those wacky mysterious tables at the bottom of the Character.java
source.
And now you've got me doing the same ;-)
The
> only problem is that we'll have to convert LOWERCASE_LETTER to a
> straight bitset that maps char -> yes/no if it's in that set. So
we
> could do test LA(1), the next char of lookahead, against it. Also,
> don't forget that ANTLR needs to do the grammar analysis so that it
can
> determine if you have a nondetermism. For example, I presume the
> following would be ambiguous/nondeterministic:
>
> DUH : LOWERCASE_LETTER | BASIC_LATIN | BENGALI ;
Perhaps a restriction on combining pre-defined UnicodeBlocks and
UnicodeCategories in this manner since they rarely make sense
together?
I see UnicodeBlock as being more useful in setting CharVocabulary
option although that may just be because I'm relatively unfamilair
with them. In fact I think that is what it is.....ignore me.
> I believe I could predefine all these character categories and then
> simply let you refer to them. Note that for every set you
reference
> though would be a potentially very big uncompressed bitset. Worst
case,
> if you have digit \uFFFE defined, you'd have about 65 kilobits in
the
> set or about 8k per bitset. Every time I have to combine this set
with
> a character or another set, that's another 8k worst case. I could
> probably make a simple compression that ignored long runs of zeros
on
> the front of the bit set.
Perhaps a separation of BitSet's implementation from it's interface
so one could create a SparseRangeBitSet implementation that simply
stores a list of the ranges of "YES" characters (or NO depending
which is more compact) permitted as perhaps two int arrays and only
generates a "traditional" BitSet if it is essential to do so.
> Is this the kind of thing you would need? I.e., to be able to
> specifically refer to predefined Java character blocks as
predefined
> ANTLR "characters"?
Yes, that would a very good start for better Unicode support in ANTLR.
Cheers,
Micheal
PS Would you be kind enough to look at my original post please?
I would like your opinion on choosing between (b) and (c) and,
the questions I raised about both [ e.g. how to shut off the warnings
on (b) ]
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list