[antlr-interest] Why does Lexer of C++ run time target eat so much memory

Tue Dec 16 18:28:50 PST 2008

Thanks Jim,Your suggestion is very helpful to me.
I have checked my grammar according to your advice and SKIP some useless
tokens.
After doing this, the peak memory usage decrease  from 660M to 480M.
However 480M is still a very large number to me.

And I found some strange behaviors of Lexer of ANTLR3 C Runtime.

The first two rules of my grammar to be called  when parsing  is:
syntax
: ISO_HEADER  header_sec data_sec ISO_END
;
header_sec
: HEADER  file_des file_name file_schema END_SEC
;

I placed breakpoints in both functions.
And the first parameter of function "antlr3CommonTokenStreamSourceNew"
 which I think is used to adjust the token buffer size is left to the
default value ANTLR3_SIZE_HINT(1025).
            tokens = antlr3CommonTokenStreamSourceNew  (ANTLR3_SIZE_HINT,
TOKENSOURCE(lex));

I believe the number of tokens will never exceed 1025 when parsing, but what
happened later makes me doubt about this.

When debugging, I found the breakpoint in syntax rule( top rule) was
triggered right after parser started to work.
That's OK, because synatx() is called in the main function

But what's strange was , the breakpoint in header_sec was not be triggered
until the memory kept increasing to 478M!!
It is believed that the Lexer kept all the tokens recognized from the large
input file in buffer before parser really started to work.
Am I right? If I am, then what does ANTLR3_SIZE_HINT mean? How to adjust the
buffer size in C runtime?

2008/12/17 Jim Idle <jimi at temporal-wave.com>

> On Tue, 16 Dec 2008 15:54:58 -0800, chain one <chainone at gmail.com> wrote:
>
> > Still waiting for help
> > I just wanna know, if c runtime target is suitable for large input?
>
> Yes.
>
>
> >
> > On 12/16/08, chain one <chainone at gmail.com> wrote:
> >> Hi,
> >> These days I am writing a parser for a kind of data file using C++. The
> >> format of the data file is simple, so the rules are simple.
> >> But when I feed a about 20M-size data file to the parser, the parser
> >> eats
> >> almost 600M+ memory.
> >> I am surprised by this result and I found most memory and time were
> >> consumed
> >> by the Lexer.
>
> There is possibly something not quite right with this then.
>
> However, a 20M input file is going to generate a lot of tokens and you need
> all of tokens in order to
> parse the input, hence you are using a lot of memory - especially if a lot
> of your tokens are just a few characters. If all your tokens were one
> character then you would need 20M tokens - that would be the worst case and
> your case will be something less than this.
>
> One way to reduce the number of tokens is to use the SKIP(); macro on
> tokens that you don't need the parser to see, such as ',' or ' ' and so on.
> Otherwise they are sitting in your token stream for no reason. Only mark
> them as hidden and keep them in the token stream if you will need to examine
> them later. Otherwise SKIP them.
>
> >>
> >> Is there anything wrong with my grammar or it is the performance issue
> >> of
> >> ANTLR3 C++ runtime?
> >> I hope there is some way to get my parser more lightweight.
> >>
> >> I attached the .g file to this mail an d the data file(.txt format)
> >> could be
>
> I must have missed the original emails, sorry about that. Can you resend me
> your .g file?
>
> Jim
> >> got from this link:
> https://download.yousendit.com/Q01FSU5ONEhZY1IzZUE9PQ
> >>
> >> The test main function is:
> >> ========================================
> >> int main(int argc, char * argv[])
> >> {
>
>
>
> --
> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20081217/fb48d914/attachment.html