[antlr-interest] Re: Problems with Unicode support in ANTLR
Terence Parr
parrt at jguru.com
Thu May 16 23:13:54 PDT 2002
On Thursday, May 16, 2002, at 08:29 PM, Matthew Ford wrote:
> This approach would not work for me as I need
>
> IDENT
> options {testLiterals=true;
> paraphrase = "an identifier";}
> : ('a'..'z'|'_'|'$'|'\u0080'..'\uFFFE')
> ('a'..'z'|'_'|'0'..'9'|'$'|'\u0080'..'\uFFFE')*
> ;
>
> So rather then sub-blocks, what I need is an efficient compression
> method to
> store these bitsets in the Antlr.
Hi Matthew,
What does your IDENT example result in? I.e., what does ANTLR
generate? Something huge? it should actually do range analysis on
straight ranges, but will do a bit set for all else.
Note that \u0080..\uFFFE ain't sparse so a sparse bitset won't help...we
need one that does ranges and sparseness :) I guess that is what you're
saying :)
Shouldn't be that hard to insert. The question is: has the use of
unicode made ANTLR go really slow during analysis (it shouldn't given
that unicode ranges are limited to a few IDENT-like rules) or does it
generate massive files (2.7.2aX that is as I've done lots of work in the
bitsets)?
Also, the predefines are good for people that want to say "allow German
character set" for the charVocabulary.
There is also a way to do standard character class compression like
people use in lex and so on for NFA->DFA conversion. I'm guessing
though that large UNICODE *range* use is limited to charVocabulary and a
few rules like IDENT. Also, people writing languages that use
punctuation from the Japanese character set, for example, might have
UNICODE *chars* sprinkled all over the grammar...that is ok when they
are treated as individual chars, thankfully; turn it into a set,
however, and boom! 8k ;)
Thanks,
Ter
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list