[antlr-interest] Why does Lexer of C++ run time target eat so much memory

Tue Dec 16 19:49:40 PST 2008

Thanks Jim.After your patient explanation, I think I am clear about where
the problem is.

I think in the current C runtime target, a huge file leads to a huge number
of tokens need to be pre-stored in memory. That fact makes ANTLR C runtime
not suitable for huge file input due to the huge memory consuming issue.

I think the best choice is definitely to hand craft a lexer(having a token
buffer) and override the token stream. I believe that would be very useful
especially in the scenarios of input couldn't be got by once such as
socket-based input.

2008/12/17 Jim Idle <jimi at temporal-wave.com>

> On Tue, 16 Dec 2008 18:28:50 -0800, chain one <chainone at gmail.com> wrote:
>
> Thanks Jim,
>
>
> Your suggestion is very helpful to me.
> I have checked my grammar according to your advice and SKIP some useless
> tokens.
> After doing this, the peak memory usage decrease Â from 660M to 480M.
> However 480M is still a very large number to me.
>
> And I found some strange behaviors of Lexer of ANTLR3 C Runtime.Â
>
> The first two rules of my grammar to be called Â when parsing Â is:
> syntax
> : ISO_HEADER Â header_sec data_sec ISO_ENDÂ
> ;
> header_sec
>  : HEADER Â file_des file_name file_schema END_SECÂ
> ;
>
> I placed breakpoints in both functions.
> And the first parameter of function "antlr3CommonTokenStreamSourceNew"
> Â which I think is used to adjust the token buffer size is left to the
> default valueÂ ANTLR3_SIZE_HINT(1025).
> Â Â  Â  Â  Â  Â  Â tokens = antlr3CommonTokenStreamSourceNew
> Â (ANTLR3_SIZE_HINT, TOKENSOURCE(lex));
>
> I believe the number of tokens will never exceed 1025 when parsing, but
> what happened later makes me doubt about this.
>
>
>
> When debugging, I found the breakpoint inÂ syntax rule( top rule) wasÂ
> triggeredÂ right after parser started to work.
> That's OK, because synatx() is called in the main function
>
> But what's strange was , the breakpoint inÂ header_sec was not beÂ triggeredÂ
> until the memory kept increasing to 478M!!
> It is believed that the Lexer kept all the tokens recognized from the large
> input file in buffer before parser really started to work.
> Am I right? If I am, then what does ANTLR3_SIZE_HINT mean? How to adjust
> the buffer size in C runtime?
>
>
> 2008/12/17 Jim Idle <jimi at temporal-wave.com>
>
>> On Tue, 16 Dec 2008 15:54:58 -0800, chain one <chainone at gmail.com> wrote:
>>
>> > Still waiting for help
>> > I just wanna know, if c runtime target is suitable for large input?
>>
>> Yes.
>>
>>
>> >
>> > On 12/16/08, chain one <chainone at gmail.com> wrote:
>> >> Hi,
>> >> These days I am writing a parser for a kind of data file using C++. The
>> >> format of the data file is simple, so the rules are simple.
>> >> But when I feed a about 20M-size data file to the parser, the parser
>> >> eats
>> >> almost 600M+ memory.
>> >> I am surprised by this result and I found most memory and time were
>> >> consumed
>> >> by the Lexer.
>>
>> There is possibly something not quite right with this then.
>>
>> However, a 20M input file is going to generate a lot of tokens and you
>> need all of tokens in order to
>> parse the input, hence you are using a lot of memory - especially if a lot
>> of your tokens are just a few characters. If all your tokens were one
>> character then you would need 20M tokens - that would be the worst case and
>> your case will be something less than this.Â
>>
>> One way to reduce the number of tokens is to use the SKIP(); macro on
>> tokens that you don't need the parser to see, such as ',' or ' ' and so on.
>> Otherwise they are sitting in your token stream for no reason. Only mark
>> them as hidden and keep them in the token stream if you will need to examine
>> them later. Otherwise SKIP them.
>>
>
> I think you are not understanding the interaction. The lexer will run first
> and will produce a collection of ALL the tokens in the files. ONce it has
> done this, THEN the parser will run. Hence you have a HUGE number of tokens
> - many many more than 1024. You don't really need to change the size hint,
> this is just how many tokens the lexer token factory pre-allocates in one
> chunk. Increasing it will not change anything much as only the tokens are
> getting allocated anyway (it is just faster to allocate them in chunks like
> this).
>
> Looking at what you are parsing tends to make me think you would be better
> off with a filtering lexer and avoiding the parser altogether. Then you
> don't need to deal with any tokens, as you can SKIP them all. Or perhaps
> this language does not lend it self to this kind of lexer/parser interaction
> if you cannot deal with the huge memory.
>
> Here is another way though and that is to hand craft a lexer that does not
> pre-allocate tokens like that and override the token stream etc.
>
> Jim
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20081217/692e934d/attachment.html