[antlr-interest] ANTLR Java Code Generation

Mon Dec 24 14:14:31 PST 2001

Christian,

Seems like this ought to be combined with the optimization I just put 
that makes individual static methods for each bitset intialization due 
to the overflow of the main static{} section.  Does your solution run 
into problems with huge static{} sections?  For example, 2.7.2a1 is 
generating stuff like this:

private static final long[] mk_tokenSet_26() {
	long[] data = { 576179277326712832L, -4611686018427281408L, 65532L, 
0L, 0L, 0L };
	return data;
}
public static final BitSet _tokenSet_26 = new BitSet(mk_tokenSet_26());

BTW, does your optimization only reduce space for .class files or does 
it make initialization faster?  seems like it would be about the same as 
you are doing the same number of operations minus loop overhead, right?

Ter

On Tuesday, November 13, 2001, at 04:25  AM, christian.ernst at poet.de 
wrote:

> Hy Folks !
>
> While working with ANTLR i recognized a few thinks which could be
> changed in the Java Code Generation:
>
> I
> The generated Code for your BitSet's looks like:
>
> private static final long _tokenSet_0_data_[] = { -549755813896L,
> -268435457L}
> public static final BitSet _tokenSet_0 = new
> BitSet(_tokenSet_0_data_);
>
> On some grammars for example the java.g for the Java Lexer with
> UNICODE
> these Array's are getting realy big but containing mostly Sequences of
> the same Value
>
> First Case :
> private static final long _tokenSet_0_data_[] = { -549755813896L,
> -268435457L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L,......};
> or
> Second Case:
> private static final long _tokenSet_0_data_[] = { -549755813896L,
> -268435457L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,......};
>
> The Problem with these large Array's is what ByteCode for such
> Statements is generated. It looks like:
> 1. create Array long _tokenSet_0_data_[] with size n
> 2. put -549755813896L at 0
> 3. put -268435457L at 1
> 4. put -1L at 2
> .....
> n. put -1L at n
>
> Some Java Compilers ( Sun 1.2.2, Sun 1.8 ) recognize that for example
> in
> the second Case, where the rest of the Array is 0L, no initialzing is
> needed,
> because 0L is the default of an long Array, but these was changed in
> the
> current new Compilers( Sun JDK 1.3.1) back to the Long Version with
> initilalizing every member. ( i don't now why exactly, but should have
> something to do with a bug with inheritence)
>
> But in our case we know that is only used temporay until we use it in
> the same Class for initializing our static BitSet.
> So we should change this to the way how the Java Compiler rewrites our
> Code.
> One useful thing we could use is a static block( static{...} ) for
> this
> job
>
> Example one:
> private static final long _tokenSet_0_data_[] =  {
> 1,2,3,4,5,-1L,-1L,-1L,-1L,-1L,0L,0L,13,0L,0L,-1L,-1L,-1L,-1L,-1L };
> public static final BitSet _tokenSet_0 = new
> BitSet(_tokenSet_0_data_);
>
> will be generated as following:
> public static final BitSet _tokenSet_0;
> static {
>      // initializing BitSet _tokenSet_0
>      long _tokenSet_0_data_[] = new long[20];
>      _tokenSet_0_data_[0] = 1L;
>      _tokenSet_0_data_[1] = 2L;
>      _tokenSet_0_data_[2] = 3L;
>      _tokenSet_0_data_[3] = 4L;
>      _tokenSet_0_data_[4] = 5L;
>      for(int i = 5 ; i <= 9 ; i++) { _tokenSet_0_data_[i] = -1L; }
>      _tokenSet_0_data_[12] = 13L;
>      for(int i = 15 ; i <= 19 ; i++) { _tokenSet_0_data_[i] = -1L; }
>      _tokenSet_0 = new BitSet(_tokenSet_0_data_);
> }
>
> So if BitSets are large but containing lots of long identical
> Sequences
> this is  more efficient !
> For example see JAVA 1.3 ANTLR Grammar for the Lexer
> With this Solution you can cut down the size of the
> JAVA Lexer Source File from 100k to 53k and the Class File from 93k to
> 18k
>
> ---------------------------------------
> Patch:
> ---------------------------------------
> ---------------------------------------
> Package:
> antlr.collections.impl
> Class:
> BitSet
> Todo:
> add new Method for accessing the internal long Array named:
> toLongArray()
> --------------------------------------
> Code:
> --------------------------------------
> /**
>  * helper Method for getting the internal Array of Word's (bits
> long[])
>  * is needed for generating nicer Java Code
>  * @return long[]
>  */
> public long[] toLongArray() {
>  return bits;
> }
> ---------------------------------------
> ---------------------------------------
> Package:
> antlr
> Class:
> JavaCodeGenerator
> Todo:
> modify the Java Code Generation method  named: genBitsets(Vector
> bitsetList, int maxVocabulary)
> ---------------------------------------
> Code:
> ---------------------------------------
> /** Generate all the bitsets to be used in the parser or lexer
>  * Generate the raw bitset data like "long _tokenSet1_data[] = {...};"
>  * and the BitSet object declarations like "BitSet _tokenSet1 = new
> BitSet(_tokenSet1_data);"
>  * Note that most languages do not support object initialization
> inside
> a
>  * class definition, so other code-generators may have to separate the
>  * bitset declarations from the initializations (e.g., put the
> initializations
>  * in the generated constructor instead).
>  * @param bitsetList The list of bitsets to generate.
>  * @param maxVocabulary Ensure that each generated bitset can contain
> at
> least this value.
>  */
> protected void genBitsets(Vector bitsetList, int maxVocabulary)
> {
>
>     for (int i = 0; i < bitsetList.size(); i++)
>     {
>         BitSet p = (BitSet) bitsetList.elementAt(i);
>         // Ensure that generated BitSet is large enough for vocabulary
>         p.growToInclude(maxVocabulary);
>     }
>
>     // generate the Java Code
>
>     // in some Conditions these Bitsets are containing
>     // long sequence of identical bits
>     // if we initialize these long sequences with
>     // long bits[] = {
> 434324,3234,623,-1L,-1L,-1L,-1L,0L,0L,0L,0L,...};
>
>     // Bitset set = new Bitset(bits);
>     // the Class Files gets realy huge
>     // the reason is that for every element in the declaration
>     // inside {} the java compiler generates bytecode
>     // which is equal to bits[i] = element
>     // even when the element is 0 and for sequences
>     // generating these initializer on our own we can
>     // optimize that
>     // by not initializing 0
>     // by using loops for long sequences
>     // in addition these make in average the Java Source also smaller
>
>     // for example see JAVA 1.3 ANTLR Grammar for the Lexer
>     // with this solution you can cut down the size of the
>     // JAVA Lexer Source from 100k to k 53K
>     // and the Class File from 93k to k 18K
>
>     println("");
>     // declare our static variable for our Bitset's
>     for (int i = 0; i < bitsetList.size(); i++)
>     {
>         println("public static final BitSet " + getBitsetName(i)
> +";");
>     }
>
>     // generate the static block for initializing
>     println("");
>     println("// BitSet initializing ");
>     println("static {");
>     for (int i = 0; i < bitsetList.size(); i++)
>     {
>         long bits[] = ((BitSet)
> bitsetList.elementAt(i)).toLongArray();
>
>         int bitLength = bits.length;
>
>         println("    // initializing BitSet " + getBitsetName(i));
>         println("    long " + getBitsetName(i) + "_data_" + "[] = new
> long[" + bitLength + "];");
>
>         int seqStartIndex = 0;
>         boolean seq = false;
>         for (int index = 0; index < bitLength; index++)
>         {
>             // next ? next is identical  ? => sequence
>             if((index + 1 < bitLength) && (bits[index] == bits[index +
> 1]))
>             {
>                 seq = true;
>             }
>             else
>             {
>                 // next not identical
>                 // sequence ending generate code for sequence ?
>                 if (seq)
>                 {
>                     // generate code only if sequence isn't 0L
>                     if (bits[index] != 0L)
>                     {
>                         print("    for(int i = " + seqStartIndex + "
> ;
> i
> <= " + index + " ; i++) {");
>                         print(getBitsetName(i) + "_data_" + "[i] = " +
> bits[index] + "L;");
>                         println("}");
>                     }
>                     // sequence over
>                     seq = false;
>                 }
>                 else
>                 {
>                     // generate normal code
>                     println("    "+getBitsetName(i) + "_data_" + "[" +
> index + "] = " + bits[index] + "L;");
>                 }
>                 seqStartIndex = index + 1;
>             }
>         }
>         println("    "+getBitsetName(i)+ " = new
> BitSet("+getBitsetName(i) + "_data_); ");
>         println("");
>     }
>     // end of the static block
>     println("}");
> }
> ------------------------------------------
>
> mfg
> christian
>
>
>
>
>
> Your use of Yahoo! Groups is subject to 
> http://docs.yahoo.com/info/terms/
>
>
--
Chief Scientist & Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/