[antlr-interest] Bounding the token stream in the C backend

Thu Feb 25 09:29:48 PST 2010

Jim,

You didn't read my email. The input file is 39MB and legitimately has
more than 12M tokens. I've stepped through the code and the tokenizer
terminates. The problem is that it grabs >3GB of memory in the process
and the parser as a whole grinds to a halt due to memory pressure.
Presumably, I have to replace the tokenizer with one that does
buffering, but I'm not sure where I should start. If I did some work
on this, is it something you'd be interested in incorporating into the
trunk?

Regards,
Chris

On Thu, Feb 25, 2010 at 10:40 AM, Jim Idle <jimi at temporal-wave.com> wrote:
> The problem is your lexer (almost 100%). Look for a rule that has an empty alt. This rule will match forever and consume no input:
>
> FRED : ;
>
> Jim
>
>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of Nick Vlassopoulos
>> Sent: Thursday, February 25, 2010 7:31 AM
>> To: Christopher L Conway
>> Cc: antlr-interest at antlr.org
>> Subject: Re: [antlr-interest] Bounding the token stream in the C
>> backend
>>
>> Hi Christopher,
>>
>> I am not entirely sure, but you may have run into the same problem as I
>> did
>> a
>> while ago. You may want to have a look at the discussion thread back
>> then
>> for
>> some advices:
>> http://www.antlr.org/pipermail/antlr-interest/2009-April/034125.html
>> Although I used the simple solution Jim suggested, i.e. parsed the
>> headers and just used some custom code to parse the rest of the file,
>> some of the advices in that thread might be helpful.
>>
>> Hope this helps,
>>
>> Nikos
>>
>>
>> On Thu, Feb 25, 2010 at 6:09 AM, Christopher L Conway
>> <cconway at cs.nyu.edu>wrote:
>>
>> > I've got a large input file (~39MB) that I'm attempting to parse with
>> > an ANTLR3-generated C parser. The parser is using a huge amount of
>> > memory (~3.7GB) and seems to start thrashing without making much
>> > progress towards termination. I found a thread from earlier this
>> month
>> > (http://markmail.org/message/jfngdd2ci6h7qrbo) suggesting the most
>> > likely cause of such behavior is a parser bug, but I've stepped
>> > through the code and it seems to be lexing just fine. Rather, it
>> seems
>> > the problem is that fillBuffer() is tokenizing the whole file in one
>> > go; then, the parsing rules slow to a crawl because the token buffer
>> > is sitting on all the memory.
>> >
>> > I wonder if there is a way to change fillBuffer()'s behavior, so that
>> > it will only lex some bounded number of tokens before allowing
>> parsing
>> > to proceed?
>> >
>> > Thanks,
>> > Chris
>> >
>> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> > Unsubscribe:
>> > http://www.antlr.org/mailman/options/antlr-interest/your-email-
>> address
>> >
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
>> email-address
>
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>