[antlr-interest] [C] my v3 Parser no reuse() slower 20% than v2. With reuse() 2GB leaks, oops.

Wed Nov 16 02:46:23 PST 2011

Hi Jim,

I have spent 2 days running around this, and now I am ready describe what I
see, to get your help, and it seems exists bug/leaks in reuse() area. Or I
not correctly use it, but I do as you described in single letter 3 months
ago.

So ... Long story :-)

* I have simple bench that do 100K INSERT commands.

    v2 parser do this in 19 seconds.
    v3 parser no reuse do this in 24 seconds.

OF COURSE we must expect speedup if to reuse lexer/parser.

So I have design code to be able easy switch between these 2 ways.
And when I try go with reuse I get comparable speed by 2GB of RAM eaten.

=====================================
* Using Apple XCODE 4.2 Instruments, I see what is going on.

   this is not leaks actually, just parser always allocate and allocate
    ANTLR_STRING objects, in parser and tree-parser rules which use

            $c.text

=====================================
FOR EXAMPLE:

* I did have in the parser rule:

    hex_string_literal
        :    s = HEX_NUMBER     -> CONST_STR_HEX[$s.text->chars]
        ;

ZERO my own code here. Right?
And I see that $s.text    in C code expanded to getText() allocates and
allocates ... 
So it is never reused as I understand.

=====================================
BTW

When I have to see that get_Text() is used, and I remember you told avoid
this, 
I have jump to sources and have come to  idea:

    why here to create new token, I need getText() ??
    May be I can just change token type as the following:

hex_string_literal
    :    s = HEX_NUMBER  { $s->setType( $s, CONST_STR_HEX ); }
    ;

And it seems this works fine....

I have correct few rules in such way in the parser....
But Tree Parser  still have for example this:

general_literal returns [ENode_Const_Ptr res]
    : cd=CONST_DATE {res=make_enode_date ( GET_FBL_STRING( $cd.text) );}
    | ct=CONST_TIME {res=make_enode_time ( GET_FBL_STRING( $ct.text) );}
    | s=const_str   {res=make_enode_str  ( GET_FBL_STRING( $s.text ) );}
    ;

All these  $c.text  calls getText() -- this makes COPY of string buffer,
Then I convert into our own FBL_String...

PROBLEM 1:  this ANTLR STRINGs produced by get_Text()  never are reused as I
see.

PROBLEM 2:  related to speed also ‹ how we can avoid here make copy of
string?
     in sources I see that exists code as

    return ((pANTLR3_COMMON_TREE)(tree->super))->token->getText(
                           ((pANTLR3_COMMON_TREE)(tree->super))->token);

May be something can be optimized/hacked here?
For example may be I can write own func, which check what token have
  char* or ANTLR_String, and choose way ...

But what syntax come to token in the .g?
I can do own macro of course ...
Just I want get some feedback if this can be a good idea for all?

=====================================
And this is how I try reuse Lexer/Parser and NOT TreeParser.
All follow to your letter Jim:

void SqlParser_v3::ResuseParserObjects(
    const char*        inTextToParse,
    vuint32            inLength )
{
    // -------------------------------
    // TREE PARSER cannot be reused. Destroy it.
    //
    if( mpTreeParser )
    {
        mpTreeParser->free( mpTreeParser );
        mpTreeParser = NULL;
    }

    if( mpNodes )
    {
        mpNodes->free( mpNodes );
        mpNodes = NULL;
    }    

    // -------------------------------
    // Reuse other objects
    //
    mpInput->reuse(
        mpInput, 
        (pANTLR3_UINT8) inTextToParse,
        (ANTLR3_UINT32) inLength,
        (pANTLR3_UINT8) "VSQL" );

    mpTokenStream->reset( mpTokenStream );
    mpLexer         ->reset( mpLexer );
    mpParser     ->reset( mpParser );

    ResetOwnData( mpParser );
}

-- 
Best regards,

Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc

Valentina - Joining Worlds of Information
http://www.paradigmasoft.com

[I feel the need: the need for speed]