[antlr-interest] [C] code to change Token type, use char* and loose data when buffer destroyed

Jim Idle jimi at temporal-wave.com
Wed Sep 28 08:49:04 PDT 2011


You can of course process things anywhere that it does not cause ambiguity
but the best approach is to defer any processing that you can until the
last point in time, so that you do not process anything that you find you
don't actually need to. The second 'rule' is that you only want to process
things once, so process and cache the result for later.

If you can modify the input stream, then you don't need to copy anything
here, just move the start and end pointers in the token and overwrite the
few bytes that you are moving. That way there is no malloc and nothing to
free. If you cannot modify the input stream, then you will need to copy
from the token pointers of course.

So, here you should lex the escape characters and the embedded '' in to
STRING_LITERAL but not try to process the WS* there, return two or more
tokens. Then the parser or tree parser can process the strings. If you are
going to do multiple walks, then probably in the parser, but if just one
walk (ot only one walk where you care about the text represented by the
tokens), then process in the tree parser when you hit the STRING_LITERAL+

Jim

> -----Original Message-----
> From: Ruslan Zasukhin [mailto:ruslan_zasukhin at valentina-db.com]
> Sent: Tuesday, September 27, 2011 11:41 PM
> To: antlr-interest at antlr.org; Jim Idle
> Subject: Re: [antlr-interest] [C] code to change Token type, use char*
> and loose data when buffer destroyed
>
> Hi Jim,
>
> What you think about this idea to resolve everything on the LEXER
> level?
>
> So we must resolve tokens as
>
> * STRING_LITERAL          'aa'
> * STRING_LITERAL          'aa' ws* 'bb'     => Token( "aabb" )
>
> * STRING_LITERAL          'aa\'bb'          => Token( "aa'bb" )
> * STRING_LITERAL          'aa''bb'           => Token( "aa'bb" )
> * STRING_LITERAL          'aa''bb''cc'      => Token( "aa'bb'cc" )
>
> * HEX_LITERAL              x'aa'                  => Token( "aabb" )
> * HEX_LITERAL              x'aa' ws* 'bb'     => Token( "aabb" )
>
>
> Do you think we can do this in [C] without copying buffers?
> I think not.
>
> Then question is:
>     how this can be solved using minimal copies?
>
> Or you think that better really use
>     Lexer -> Parser -> TreeParser combination ?
>
>
> On 9/28/11 1:34 AM, "Ruslan Zasukhin" <ruslan_zasukhin at valentina-
> db.com>
> wrote:
>
> > On 9/28/11 12:46 AM, "Douglas Godfrey" <douglasgodfrey at gmail.com>
> wrote:
> >
> > Hi Douglas,
> >
> > Yes, I have thinked about this way also.
> >
> > But in your solution you use helper functions as
> >     RemoveQuotePairs()
> >
> > Which, I guess do some coping in additional ram buffers.
> > This is fine for Java guys, but in C code, as Jim likes underline
> each
> > time, we tend to use only pointers to input buffer, as long as
> possible.
> >
> >
> >> You need to modify your string lexing rules to use sub-rules for the
> >> elementary strings and return the concatenated string as the lexer
> >> token value.
> >>
> >> The value of
> >>
> >> StringConstant: QuotedString
> >> {RemoveQuotePairs($QuotedString);};
> >>
> >> fragment
> >> QuotedString:  ( StringTerm )+;
> >>
> >> fragment
> >> StringTerm:  Dquote ( Character )* Dquote;
> >>
> >> fragment
> >> Character: ( ' ' | AlphaChar | Punctuation | Digit );
>
> --
> Best regards,
>
> Ruslan Zasukhin
> VP Engineering and New Technology
> Paradigma Software, Inc
>
> Valentina - Joining Worlds of Information http://www.paradigmasoft.com
>
> [I feel the need: the need for speed]
>


More information about the antlr-interest mailing list