[antlr-interest] Re: Problems with Unicode support in ANTLR

Matthew Ford Matthew.Ford at forward.com.au
Fri May 17 00:21:48 PDT 2002


That code Unicode was generated with Antlr 2.7.1  haven't tried anything
later since this seems to work.
matthew
----- Original Message -----
From: "Terence Parr" <parrt at jguru.com>
To: <antlr-interest at yahoogroups.com>
Sent: Friday, May 17, 2002 4:13 PM
Subject: Re: [antlr-interest] Re: Problems with Unicode support in ANTLR


>
> On Thursday, May 16, 2002, at 08:29  PM, Matthew Ford wrote:
>
> > This approach would not work for me as I need
> >
> > IDENT
> >  options {testLiterals=true;
> >      paraphrase = "an identifier";}
> >  : ('a'..'z'|'_'|'$'|'\u0080'..'\uFFFE')
> > ('a'..'z'|'_'|'0'..'9'|'$'|'\u0080'..'\uFFFE')*
> >  ;
> >
> > So rather then sub-blocks, what I need is an efficient compression
> > method to
> > store these bitsets in the Antlr.
>
> Hi Matthew,
>
> What does your IDENT example result in?  I.e., what does ANTLR
> generate?  Something huge?  it should actually do range analysis on
> straight ranges, but will do a bit set for all else.
>
> Note that \u0080..\uFFFE ain't sparse so a sparse bitset won't help...we
> need one that does ranges and sparseness :)  I guess that is what you're
> saying :)
>
> Shouldn't be that hard to insert.  The question is: has the use of
> unicode made ANTLR go really slow during analysis (it shouldn't given
> that unicode ranges are limited to a few IDENT-like rules) or does it
> generate massive files (2.7.2aX that is as I've done lots of work in the
> bitsets)?
>
> Also, the predefines are good for people that want to say "allow German
> character set" for the charVocabulary.
>
> There is also a way to do standard character class compression like
> people use in lex and so on for NFA->DFA conversion.  I'm guessing
> though that large UNICODE *range* use is limited to charVocabulary and a
> few rules like IDENT.  Also, people writing languages that use
> punctuation from the Japanese character set, for example, might have
> UNICODE *chars* sprinkled all over the grammar...that is ok when they
> are treated as individual chars, thankfully; turn it into a set,
> however, and boom! 8k ;)
>
> Thanks,
> Ter
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
>
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>


 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



More information about the antlr-interest mailing list