[antlr-interest] Re: Problems with Unicode support in ANTLR
Matthew Ford
Matthew.Ford at forward.com.au
Fri May 17 00:14:41 PDT 2002
Hi Ter,
Here is the code that is generated for Ident.
No big bitsets. (You fixed that.)
public final void mIDENT(boolean _createToken) throws RecognitionException,
CharStreamException, TokenStreamException {
int _ttype; Token _token=null; int _begin=text.length();
_ttype = IDENT;
int _saveIndex;
{
switch ( LA(1)) {
case 'a': case 'b': case 'c': case 'd':
case 'e': case 'f': case 'g': case 'h':
case 'i': case 'j': case 'k': case 'l':
case 'm': case 'n': case 'o': case 'p':
case 'q': case 'r': case 's': case 't':
case 'u': case 'v': case 'w': case 'x':
case 'y': case 'z':
{
matchRange('a','z');
break;
}
case '_':
{
match('_');
break;
}
case '$':
{
match('$');
break;
}
default:
if (((LA(1) >= '\u0080' && LA(1) <= '\ufffe'))) {
matchRange('\u0080','\uFFFE');
}
else {
throw new NoViableAltForCharException((char)LA(1), getFilename(),
getLine());
}
}
}
{
_loop8:
do {
switch ( LA(1)) {
case 'a': case 'b': case 'c': case 'd':
case 'e': case 'f': case 'g': case 'h':
case 'i': case 'j': case 'k': case 'l':
case 'm': case 'n': case 'o': case 'p':
case 'q': case 'r': case 's': case 't':
case 'u': case 'v': case 'w': case 'x':
case 'y': case 'z':
{
matchRange('a','z');
break;
}
case '_':
{
match('_');
break;
}
case '0': case '1': case '2': case '3':
case '4': case '5': case '6': case '7':
case '8': case '9':
{
matchRange('0','9');
break;
}
case '$':
{
match('$');
break;
}
default:
if (((LA(1) >= '\u0080' && LA(1) <= '\ufffe'))) {
matchRange('\u0080','\uFFFE');
}
else {
break _loop8;
}
}
} while (true);
}
_ttype = testLiteralsTable(_ttype);
if ( _createToken && _token==null && _ttype!=Token.SKIP ) {
_token = makeToken(_ttype);
_token.setText(new String(text.getBuffer(), _begin,
text.length()-_begin));
}
_returnToken = _token;
}
----- Original Message -----
From: "Terence Parr" <parrt at jguru.com>
To: <antlr-interest at yahoogroups.com>
Sent: Friday, May 17, 2002 4:13 PM
Subject: Re: [antlr-interest] Re: Problems with Unicode support in ANTLR
>
> On Thursday, May 16, 2002, at 08:29 PM, Matthew Ford wrote:
>
> > This approach would not work for me as I need
> >
> > IDENT
> > options {testLiterals=true;
> > paraphrase = "an identifier";}
> > : ('a'..'z'|'_'|'$'|'\u0080'..'\uFFFE')
> > ('a'..'z'|'_'|'0'..'9'|'$'|'\u0080'..'\uFFFE')*
> > ;
> >
> > So rather then sub-blocks, what I need is an efficient compression
> > method to
> > store these bitsets in the Antlr.
>
> Hi Matthew,
>
> What does your IDENT example result in? I.e., what does ANTLR
> generate? Something huge? it should actually do range analysis on
> straight ranges, but will do a bit set for all else.
>
> Note that \u0080..\uFFFE ain't sparse so a sparse bitset won't help...we
> need one that does ranges and sparseness :) I guess that is what you're
> saying :)
>
> Shouldn't be that hard to insert. The question is: has the use of
> unicode made ANTLR go really slow during analysis (it shouldn't given
> that unicode ranges are limited to a few IDENT-like rules) or does it
> generate massive files (2.7.2aX that is as I've done lots of work in the
> bitsets)?
>
> Also, the predefines are good for people that want to say "allow German
> character set" for the charVocabulary.
>
> There is also a way to do standard character class compression like
> people use in lex and so on for NFA->DFA conversion. I'm guessing
> though that large UNICODE *range* use is limited to charVocabulary and a
> few rules like IDENT. Also, people writing languages that use
> punctuation from the Japanese character set, for example, might have
> UNICODE *chars* sprinkled all over the grammar...that is ok when they
> are treated as individual chars, thankfully; turn it into a set,
> however, and boom! 8k ;)
>
> Thanks,
> Ter
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
>
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list