[antlr-interest] Why does Lexer of C++ run time target eat so much memory

Jim Idle jimi at temporal-wave.com
Tue Dec 16 19:08:49 PST 2008


On Tue, 16 Dec 2008 18:28:50 -0800, chain one <chainone at gmail.com> wrote:


Thanks Jim,


Your suggestion is very helpful to me.
I have checked my grammar according to your advice and SKIP some useless tokens.
After doing this, the peak memory usage decrease  from 660M to 480M.
However 480M is still a very large number to me.

And I found some strange behaviors of Lexer of ANTLR3 C Runtime. 

The first two rules of my grammar to be called  when parsing  is:
syntax
	: ISO_HEADER Â header_sec data_sec ISO_ENDÂ 
	;
header_sec
	: HEADER Â file_des file_name file_schema END_SECÂ 
	;

I placed breakpoints in both functions.
And the first parameter of function "antlr3CommonTokenStreamSourceNew"  which I think is used to adjust the token buffer size is left to the default value ANTLR3_SIZE_HINT(1025).
            tokens = antlr3CommonTokenStreamSourceNew  (ANTLR3_SIZE_HINT, TOKENSOURCE(lex));

I believe the number of tokens will never exceed 1025 when parsing, but what happened later makes me doubt about this.



When debugging, I found the breakpoint in syntax rule( top rule) was triggered right after parser started to work.
That's OK, because synatx() is called in the main function

But what's strange was , the breakpoint in header_sec was not be triggered  until the memory kept increasing to 478M!!
It is believed that the Lexer kept all the tokens recognized from the large input file in buffer before parser really started to work.
Am I right? If I am, then what does ANTLR3_SIZE_HINT mean? How to adjust the buffer size in C runtime?


2008/12/17 Jim Idle <jimi at temporal-wave.com>
On Tue, 16 Dec 2008 15:54:58 -0800, chain one <chainone at gmail.com> wrote:

> Still waiting for help
> I just wanna know, if c runtime target is suitable for large input?

Yes.



>
> On 12/16/08, chain one <chainone at gmail.com> wrote:
 >> Hi,
>> These days I am writing a parser for a kind of data file using C++. The
>> format of the data file is simple, so the rules are simple.
>> But when I feed a about 20M-size data file to the parser, the parser 
 >> eats
>> almost 600M+ memory.
>> I am surprised by this result and I found most memory and time were 
>> consumed
>> by the Lexer.

There is possibly something not quite right with this then. 

However, a 20M input file is going to generate a lot of tokens and you need all of tokens in order to 
parse the input, hence you are using a lot of memory - especially if a lot of your tokens are just a few characters. If all your tokens were one character then you would need 20M tokens - that would be the worst case and your case will be something less than this. 

One way to reduce the number of tokens is to use the SKIP(); macro on tokens that you don't need the parser to see, such as ',' or ' ' and so on. Otherwise they are sitting in your token stream for no reason. Only mark them as hidden and keep them in the token stream if you will need to examine them later. Otherwise SKIP them.

I think you are not understanding the interaction. The lexer will run first and will produce a collection of ALL the tokens in the files. ONce it has done this, THEN the parser will run. Hence you have a HUGE number of tokens - many many more than 1024. You don't really need to change the size hint, this is just how many tokens the lexer token factory pre-allocates in one chunk. Increasing it will not change anything much as only the tokens are getting allocated anyway (it is just faster to allocate them in chunks like this).

Looking at what you are parsing tends to make me think you would be better off with a filtering lexer and avoiding the parser altogether. Then you don't need to deal with any tokens, as you can SKIP them all. Or perhaps this language does not lend it self to this kind of lexer/parser interaction if you cannot deal with the huge memory.

Here is another way though and that is to hand craft a lexer that does not pre-allocate tokens like that and override the token stream etc.

Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20081216/4a1e6cf3/attachment.html 


More information about the antlr-interest mailing list