[antlr-interest] v2->v3 Skip chars in Lexer? For C-target [SOLVED 2.5]

Sun Apr 17 08:37:35 PDT 2011

Why do you have to copy the token? You just pass a pointer to it, and when
you want the text, use the pointers in the token.

You solution is fine, but I don't think it works in all cases of
fragments, but cannot remember why just now. There are solutions in
antlr.markmail.org

Jim

> -----Original Message-----
> From: Ruslan Zasukhin [mailto:ruslan_zasukhin at valentina-db.com]
> Sent: Sunday, April 17, 2011 5:38 AM
> To: antlr-interest at antlr.org; Jim Idle
> Subject: Re: [antlr-interest] v2->v3 Skip chars in Lexer? For C-target
> [SOLVED 2.5]
>
> Hi All,
>
> After Jim points to more effective way skip wrapper-quotes, And some
> more time, this is working solution for archive:
>
> //--------------------------------------------------------------------
> IDENT
>     :    ( LETTER | '_' ) ( LETTER | '_' | DIGIT )*
>     ;
>
> // RZ 04/17/11: in ANTLR v3 there is no way skip chars in lexer. Oops.
> //    Instead we do trick suggest by Jim Idle on ANTLR list:
> //  skip first/last chras of token on the parser level.
> //
> DELIMITED        // delimited_identifier
>     :
>     (    DQUOTE ( ~(DQUOTE) | DQUOTE DQUOTE )+ DQUOTE
>     |    BQUOTE ( ~(BQUOTE) | BQUOTE BQUOTE )+ BQUOTE
>     |    LBRACK ( ~(']') )+ RBRACK
>     )
>     ;
>
>
> And on the parser level, we use Token and its pointers to ++ / -- Also
> type of Token is changed to IDENT with help of re-write.
>
>
> //--------------------------------------------------------------------
> identifier
>     :    IDENT            // regular_identifier
>
>     |    d=DELIMITED     // delimited_identifier
>         {
>             ++$d->start;
>             --$d->stop;
>         }
>         -> ^( IDENT[$d.text->chars] )
>     ;
>
>
>
> ================
> Works... But ...
> I am far not sure that this solution is really more effective, Jim.
>
> Yes, on lexer level I have use   ->chars, and you say it is slower ...
>
> But on parser level, except to fast ++ / -- operations, we need yet
> create second token IDENT and copy all values from the first ...
>
> Sizeof( ANTLR3_COMMON_TOKEN_struct)  is about 160-200 bytes.
>
> So creation by new and copy about 150 bytes to skip TWO chars not looks
> so cheap operation.  Also note that IDENTs usually 5-20 chars only.
> Much less of 200 bytes of that structure.
>
>
> And may be my first solution with Lexer level was not so bad?
>
> And I still have TODO:  skip chars inside of LITERAL on parser level
> ...
>     here we cannot do just ++ \ --
>
>
> ================
> I do not see yet the whole picture how works lexer on low level in C.
>
> Also I do not see yet any clean information about UTF encodings in C-
> target.
> I am going ask about this in future letters.
>
>
> --
> Best regards,
>
> Ruslan Zasukhin
> VP Engineering and New Technology
> Paradigma Software, Inc
>
> Valentina - Joining Worlds of Information http://www.paradigmasoft.com
>
> [I feel the need: the need for speed]
>