[antlr-interest] Java code generator memory optimization

Sun Sep 25 11:31:37 PDT 2005

On Sep 25, 2005, at 8:25 AM, Akhilesh Mritunjai wrote:

> Hi Terence
>
> My comments inline:
>
> --- Terence Parr <parrt at cs.usfca.edu> wrote:
>
>> In the ANTLR v3 version, I have tokens point at the
>> start/stop index
>> into a single char buffer that has the entire input
>> text (well, that
>> is the default anyway).  So, you have a duplicates
>> still in the sense
>> that all references to identifier "salary" are not
>> shared, but at
>> least there are not multiple copies as there are now
>> by default. :)
>>
>
> afaik, thats how the current one works too.

Current CommonToken makes a new String object from a buffer, which  
must make a copy as that buffer is overwritten next token (needs new  
char array).  it does not point into the buffer with indexes as does v3.

> Lexer
> makes strings from chars it gets from input stream. So
> for every identifier in stream you get entirely
> different string objects with separate char arrays.

Yes.  New one will point into one buffer; no separate char arrays.

> Of
> course, they won't be duplicated more than they occur
> in input stream... and there is no sharing at all and
> won't be with that approach in v3.0

hooray!

>> If your file is 1M, it's probably pretty big and
>> that's just not
>> enough memory to worry about this days.  Wow, I
>>
>
> Um... The certification for mine will happen on an
> input file set around 37 MB in size, and then some
> people out there must be doing continuous stream
> parsing.

Yes, it can handle the continuous (but not by default) and 37MB is  
teeny still really.  I have 2G on my box. ;)

> The current suggestion comes from my observation of
> processing an 8MB automatically generated sadist
> pathological example made by me for which the parse
> tree contains total of 5.7M nodes... 40% are
> identifier subtree nodes and every one has a string
> object. I intern'ed the node texts and, bam!!, it
> saved me 150MB of memory  :)

Nice!  Some folks have suggested making the Token object also the  
tree node, which will further factor things out.  ANTLR v3 can do  
this no sweat as I assume absolutely nothing about the type of a tree  
node.  You pass me an "adaptor" that tells me how to add children and  
navigate :)

> Uh, I dunno how to put it, but somehow Terence, you
> seem to underestimate the reach, potential and
> influence of all the kickass tools you've made. I did

Wow!  Hooray!  Thanks...the next version is gonna kick so much ass it  
will reach around and kick my own ass ;)  Ha hah hahah!  I'm hoping  
to have it "ready for student abuse" in January.

> a lot of research and will have a solid testimony once
> I complete this thing... one being making difference
> between product ending in success or a sad failure.

That's awesome...can't wait to hear the results (good or bad).

>> remember when my 16k
>> machine was great! ;)  Anybody remember which
>> processor was 1.077 mhz
>>
>
> God I'm young... my first was a 640k, 16 something MHz
> on which I learnt BASIC and MSDOS 3.3 & 5.0 more than
> a decade back :)

That's pretty good.  Some folks "became conscious" on 128M machines. ;)

BTW, the 1.077mhz machine was the 6502 in the apple II. :)

Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com