[antlr-interest] Re: Problems with Unicode support in ANTLR

Fri May 17 00:14:41 PDT 2002

Hi Ter,

Here is the code that is generated for Ident.
No big bitsets.  (You fixed that.)

 public final void mIDENT(boolean _createToken) throws RecognitionException,
CharStreamException, TokenStreamException {
  int _ttype; Token _token=null; int _begin=text.length();
  _ttype = IDENT;
  int _saveIndex;

  {
  switch ( LA(1)) {
  case 'a':  case 'b':  case 'c':  case 'd':
  case 'e':  case 'f':  case 'g':  case 'h':
  case 'i':  case 'j':  case 'k':  case 'l':
  case 'm':  case 'n':  case 'o':  case 'p':
  case 'q':  case 'r':  case 's':  case 't':
  case 'u':  case 'v':  case 'w':  case 'x':
  case 'y':  case 'z':
  {
   matchRange('a','z');
   break;
  }
  case '_':
  {
   match('_');
   break;
  }
  case '$':
  {
   match('$');
   break;
  }
  default:
   if (((LA(1) >= '\u0080' && LA(1) <= '\ufffe'))) {
    matchRange('\u0080','\uFFFE');
   }
  else {
   throw new NoViableAltForCharException((char)LA(1), getFilename(),
getLine());
  }
  }
  }
  {
  _loop8:
  do {
   switch ( LA(1)) {
   case 'a':  case 'b':  case 'c':  case 'd':
   case 'e':  case 'f':  case 'g':  case 'h':
   case 'i':  case 'j':  case 'k':  case 'l':
   case 'm':  case 'n':  case 'o':  case 'p':
   case 'q':  case 'r':  case 's':  case 't':
   case 'u':  case 'v':  case 'w':  case 'x':
   case 'y':  case 'z':
   {
    matchRange('a','z');
    break;
   }
   case '_':
   {
    match('_');
    break;
   }
   case '0':  case '1':  case '2':  case '3':
   case '4':  case '5':  case '6':  case '7':
   case '8':  case '9':
   {
    matchRange('0','9');
    break;
   }
   case '$':
   {
    match('$');
    break;
   }
   default:
    if (((LA(1) >= '\u0080' && LA(1) <= '\ufffe'))) {
     matchRange('\u0080','\uFFFE');
    }
   else {
    break _loop8;
   }
   }
  } while (true);
  }
  _ttype = testLiteralsTable(_ttype);
  if ( _createToken && _token==null && _ttype!=Token.SKIP ) {
   _token = makeToken(_ttype);
   _token.setText(new String(text.getBuffer(), _begin,
text.length()-_begin));
  }
  _returnToken = _token;
 }

----- Original Message -----
From: "Terence Parr" <parrt at jguru.com>
To: <antlr-interest at yahoogroups.com>
Sent: Friday, May 17, 2002 4:13 PM
Subject: Re: [antlr-interest] Re: Problems with Unicode support in ANTLR

>
> On Thursday, May 16, 2002, at 08:29  PM, Matthew Ford wrote:
>
> > This approach would not work for me as I need
> >
> > IDENT
> >  options {testLiterals=true;
> >      paraphrase = "an identifier";}
> >  : ('a'..'z'|'_'|'$'|'\u0080'..'\uFFFE')
> > ('a'..'z'|'_'|'0'..'9'|'$'|'\u0080'..'\uFFFE')*
> >  ;
> >
> > So rather then sub-blocks, what I need is an efficient compression
> > method to
> > store these bitsets in the Antlr.
>
> Hi Matthew,
>
> What does your IDENT example result in?  I.e., what does ANTLR
> generate?  Something huge?  it should actually do range analysis on
> straight ranges, but will do a bit set for all else.
>
> Note that \u0080..\uFFFE ain't sparse so a sparse bitset won't help...we
> need one that does ranges and sparseness :)  I guess that is what you're
> saying :)
>
> Shouldn't be that hard to insert.  The question is: has the use of
> unicode made ANTLR go really slow during analysis (it shouldn't given
> that unicode ranges are limited to a few IDENT-like rules) or does it
> generate massive files (2.7.2aX that is as I've done lots of work in the
> bitsets)?
>
> Also, the predefines are good for people that want to say "allow German
> character set" for the charVocabulary.
>
> There is also a way to do standard character class compression like
> people use in lex and so on for NFA->DFA conversion.  I'm guessing
> though that large UNICODE *range* use is limited to charVocabulary and a
> few rules like IDENT.  Also, people writing languages that use
> punctuation from the Japanese character set, for example, might have
> UNICODE *chars* sprinkled all over the grammar...that is ok when they
> are treated as individual chars, thankfully; turn it into a set,
> however, and boom! 8k ;)
>
> Thanks,
> Ter
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
>
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/