[antlr-interest] Freeing memory as you go using C target

Sat Jun 13 04:24:20 PDT 2009

Questions about having to read the whole input seem to come up about
once a month on the list.  I've never paid much attention because I'm
passing tokens in provably atomic chunks to a stateful parser which
logically reduces to slurping the whole stream.

But I am curious.  How do the folks who write real compilers answer
this question.  Seems like some grammars could make it very hard to
recognize atomic chunks at the lexing stage.  That probably says more
about my feeble lexical skill than anything else.

So how would a C# compiler be built differently by Microsoft than by
someone using antlr?

Pointers to literature gladly accepted.

Thanks
Carlos

On 6/4/09, Jim Idle <jimi at temporal-wave.com> wrote:
> Peterson, Joe wrote:
>>
>> Hello all,
>>
>> We're using the C target to generate a parser for a very large file with
>> very large independent sections.  Unfortunately, it's consuming 1GB+ of
>> memory when parsing the file.  Is there any way to free up the memory
>> consumed as we go without pre-parsing the file to chunk it up?  Everything
>> I've found in my searches seem to point to splitting up the file before
>> processing it. After a particular token, can we free the memory used by
>> that token?  For example if we had:
>>
>> contents : header body footer;
>> header : BLAH BLAH BLAH;
>> footer : BLA BLA BLA;
>> body : section+;
>>
>> section
>> @after {
>> 	// can we free everything here even though we're still in body?
>> }
>>     : LBRACE morestuff RBRACE;
>>
>
> Not really, as you must tokenize all the input before you call the parser;
> hence one way or another, you will create that many tokens and hence consume
> the memory for the tokens. Assuming you are not creating an AST and you are
> not using the ANTLR3_STRING convenience methods (via $x.text etc) and
> creating lots of strings as you go, then this must one monster of an input
> file to use that much memory just for tokens, as the tokens do not copy the
> input text unless you ask for their string values.
>
> If you are using $text references and so on, then you probably don;t want to
> do that as the string factory accumulates all the memory until the whole
> parser is finally shut down. Write your own methods that accept a token and
> do what you want with reference to the copy of the input you have. For
> istance you can use the input directly and null terminate after strings and
> ID and so on.
>
> I think that you probably need to split it up into the independent sections.
>
> However, if you can detect the end of a section lexically, then I posted
> some time back on a strategy for creating your own input stream, which you
> keep reusing until it says there are no more sections. Also, you could
> reduce the memory by decreasing the size of the token vector to remove
> things that you have definitely no more use for, but you would have to delve
> in to the internals and reset the counters in the token stream and so on. In
> the end, if these sections are really independent, it would be easier to
> just split them up unless you know the C runtime internals well :-).
>
> One final thing is that does 1GB really actually matter? Are you likely to
> run this on a machine that cannot handle that? I haven't had a machine with
> less than 8GB of memory for years :-). However, the target application
> determines that of course.
>
> There will be work done on a non buffering token stream at some point, but I
> cannot guarantee just when unless someone who is paying me wants it done ;-)
> That is probably what you need really. If you send me your grammar I might
> be able to give you some pointers (send it offline if it is commercially
> sensitive or something).
>
> Jim
> http://www.linkedin.com/in/jimidle
>
>
>

-- 
Sent from my mobile device