[antlr-interest] Re: Problems with Unicode support in ANTLR

Terence Parr parrt at jguru.com
Thu May 16 23:13:54 PDT 2002


On Thursday, May 16, 2002, at 08:29  PM, Matthew Ford wrote:

> This approach would not work for me as I need
>
> IDENT
>  options {testLiterals=true;
>      paraphrase = "an identifier";}
>  : ('a'..'z'|'_'|'$'|'\u0080'..'\uFFFE')
> ('a'..'z'|'_'|'0'..'9'|'$'|'\u0080'..'\uFFFE')*
>  ;
>
> So rather then sub-blocks, what I need is an efficient compression 
> method to
> store these bitsets in the Antlr.

Hi Matthew,

What does your IDENT example result in?  I.e., what does ANTLR 
generate?  Something huge?  it should actually do range analysis on 
straight ranges, but will do a bit set for all else.

Note that \u0080..\uFFFE ain't sparse so a sparse bitset won't help...we 
need one that does ranges and sparseness :)  I guess that is what you're 
saying :)

Shouldn't be that hard to insert.  The question is: has the use of 
unicode made ANTLR go really slow during analysis (it shouldn't given 
that unicode ranges are limited to a few IDENT-like rules) or does it 
generate massive files (2.7.2aX that is as I've done lots of work in the 
bitsets)?

Also, the predefines are good for people that want to say "allow German 
character set" for the charVocabulary.

There is also a way to do standard character class compression like 
people use in lex and so on for NFA->DFA conversion.  I'm guessing 
though that large UNICODE *range* use is limited to charVocabulary and a 
few rules like IDENT.  Also, people writing languages that use 
punctuation from the Japanese character set, for example, might have 
UNICODE *chars* sprinkled all over the grammar...that is ok when they 
are treated as individual chars, thankfully; turn it into a set, 
however, and boom! 8k ;)

Thanks,
Ter
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org


 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



More information about the antlr-interest mailing list