[antlr-interest] v2->v3 Skip chars in Lexer? For C-target [SOLVED 2.5]

Ruslan Zasukhin ruslan_zasukhin at valentina-db.com
Sun Apr 17 05:37:47 PDT 2011


Hi All,

After Jim points to more effective way skip wrapper-quotes,
And some more time, this is working solution for archive:

//--------------------------------------------------------------------
IDENT
    :    ( LETTER | '_' ) ( LETTER | '_' | DIGIT )*
    ;

// RZ 04/17/11: in ANTLR v3 there is no way skip chars in lexer. Oops.
//    Instead we do trick suggest by Jim Idle on ANTLR list:
//  skip first/last chras of token on the parser level.
// 
DELIMITED        // delimited_identifier
    :
    (    DQUOTE ( ~(DQUOTE) | DQUOTE DQUOTE )+ DQUOTE
    |    BQUOTE ( ~(BQUOTE) | BQUOTE BQUOTE )+ BQUOTE
    |    LBRACK ( ~(']') )+ RBRACK
    )    
    ;


And on the parser level, we use Token and its pointers to ++ / --
Also type of Token is changed to IDENT with help of re-write.


//--------------------------------------------------------------------
identifier
    :    IDENT            // regular_identifier
    
    |    d=DELIMITED     // delimited_identifier
        {
            ++$d->start;
            --$d->stop;
        }        
        -> ^( IDENT[$d.text->chars] )
    ;



================
Works... But ...
I am far not sure that this solution is really more effective, Jim.

Yes, on lexer level I have use   ->chars, and you say it is slower ...

But on parser level, except to fast ++ / -- operations, we need yet create
second token IDENT and copy all values from the first ...

Sizeof( ANTLR3_COMMON_TOKEN_struct)  is about 160-200 bytes.

So creation by new and copy about 150 bytes to skip TWO chars
not looks so cheap operation.  Also note that IDENTs usually 5-20 chars
only.  Much less of 200 bytes of that structure.


And may be my first solution with Lexer level was not so bad?

And I still have TODO:  skip chars inside of LITERAL on parser level ...
    here we cannot do just ++ \ --


================
I do not see yet the whole picture how works lexer on low level in C.

Also I do not see yet any clean information about UTF encodings in C-target.
I am going ask about this in future letters.


-- 
Best regards,

Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc

Valentina - Joining Worlds of Information
http://www.paradigmasoft.com

[I feel the need: the need for speed]




More information about the antlr-interest mailing list