[antlr-interest] Re: Problems with Unicode support in ANTLR

Thu May 16 19:54:33 PDT 2002

Perhaps I am missing the thrust of this exchange, but
my requirement for Unicode in Antlr is to support an English command set and
foreign language (initially Japanese)
variable names.
This talk of character blocks seems too restrictive.
I need full Unicode support to cover the possible variable names.
matthew
----- Original Message -----
From: "micheal_jor" <open.zone at virgin.net>
To: <antlr-interest at yahoogroups.com>
Sent: Friday, May 17, 2002 12:41 PM
Subject: [antlr-interest] Re: Problems with Unicode support in ANTLR

> --- In antlr-interest at y..., Terence Parr <parrt at j...> wrote:
> > > Erm....Terrence are you there?  ;-)
> >
> > Who me? ;)  I've been waiting to see what direction people would
> point
> > me. ;)  I've just looked at the source for Character.getType() and
> all
> > those wacky mysterious tables at the bottom of the Character.java
> source.
>
> And now you've got me doing the same ;-)
>
> The
> > only problem is that we'll have to convert LOWERCASE_LETTER to a
> > straight bitset that maps char -> yes/no if it's in that set.  So
> we
> > could do test LA(1), the next char of lookahead, against it.  Also,
> > don't forget that ANTLR needs to do the grammar analysis so that it
> can
> > determine if you have a nondetermism.  For example, I presume the
> > following would be ambiguous/nondeterministic:
> >
> > DUH : LOWERCASE_LETTER | BASIC_LATIN | BENGALI ;
>
> Perhaps a restriction on combining pre-defined UnicodeBlocks and
> UnicodeCategories in this manner since they rarely make sense
> together?
>
> I see UnicodeBlock as being more useful in setting CharVocabulary
> option although that may just be because I'm relatively unfamilair
> with them. In fact I think that is what it is.....ignore me.
>
> > I believe I could predefine all these character categories and then
> > simply let you refer to them.  Note that for every set you
> reference
> > though would be a potentially very big uncompressed bitset.  Worst
> case,
> > if you have digit \uFFFE defined, you'd have about 65 kilobits in
> the
> > set or about 8k per bitset.  Every time I have to combine this set
> with
> > a character or another set, that's another 8k worst case.  I could
> > probably make a simple compression that ignored long runs of zeros
> on
> > the front of the bit set.
>
> Perhaps a separation of BitSet's implementation from it's interface
> so one could create a SparseRangeBitSet implementation that simply
> stores a list of the ranges of "YES" characters (or NO depending
> which is more compact) permitted as perhaps two int arrays and only
> generates a "traditional" BitSet if it is essential to do so.
>
> > Is this the kind of thing you would need?  I.e., to be able to
> > specifically refer to predefined Java character blocks as
> predefined
> > ANTLR "characters"?
>
> Yes, that would a very good start for better Unicode support in ANTLR.
>
> Cheers,
>
> Micheal
>
> PS     Would you be kind enough to look at my original post please?
>        I would like your opinion on choosing between (b) and (c) and,
> the questions I raised about both [ e.g. how to shut off the warnings
> on (b) ]
>
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/