[antlr-interest] How do I preserve comments in a language to language translator

Wed Aug 11 14:50:45 PDT 2010

I would try and track the comments as early on in the translation as
possible, as the further you get away from the initial input, the more
difficult it will be to associate the original tokens with the transformed
structure. 

So, when you produce your initial tree, choose the set of tokens (which may
include imaginary nodes) that makes sense to associate comments with. When
you hit the parsing start of these constructs (or end) then call an external
method to track back and find the token index of the first and last comment
token. Do the rewrite, then in action code, adorn the AST node with the
token index information. Again, you want to do such things in external
methods that say accept the token index to search from and the token pointer
etc:

func
@declarations
{
 ANTLR3_MARKER firstTok;
}
@init
{
  firstTok = INDEX();
}
: ( type ID LPAREN list RPAREN
      ->^(CFUNC ID type list)
  )
  { findComments($tree, firstTok); /* Uses the node token in the current
tree, looks back from firstTok, stores in user1, user2 */ }
;

Some time later in your code generating tree walk, you will hit the CFUNC
node, you may have done all sorts of manipulations of the child nodes, but
now the user1 and user2 fields of the CFUNC token in the tree will contain
the start and end indexes of the comments you wanted. Call an external
function with a pointer to the tree node and have that function copy the
text from all the tokens in the range.

func :
  ^(c=CFUNC ID type list)
   { copyComments($c); 
      genFunc(.....
  }
;

The mechanisms in the tree parser are that $c will come in as a
pANTLR3_BASE_TREE, which has a member called super, which is a void * that
you cast to pANTLR3_COMMON_TREE. pANTLR3_BASE_TREE has a method getToken(),
which will return the payload token for the node, which is
pANTLR3_COMMON_TOKEN, this has user1, user2 and user3 ints, ANTLR3_MARKER
start and stop positions that point to the first and last character of the
text and it has a channel and index. Using the original input stream you can
use get(n) to get the token at position n and use the text pointers.

If the comment follow the structure, then the same things apply but now you
find comments at the end and in code gen, spit the out at the end. 

You can see that there will be ambiguous situations where it is difficult to
know if the comment is for the end of some statement or the start of the
next, so you will decide consistently and have to live with it ;-)

// Call stat
stat(); // We call stat to fubar
// now stat is called, x will = 99 and b points to bananas

// Call flick() to do cyz
//
flick();

Anyway, those are the main structures you need and it is pretty easy once
you get your head around the method calling. Use the runtime source code as
your guide for finding things out about structres and the API doxygen linke
from the home page.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Howard Nasgaard
> Sent: Wednesday, August 11, 2010 2:04 PM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] How do I preserve comments in a language to
> language translator
> 
> Jim,  It sounds like you understand what I need to do.  I will be happy
with a
> 'best guess' approach.  What you describe is basically what I understood
from
> the reading I've done.  What I think I am still missing is an example of
the
> mechanics of doing this in the tree grammar.  Is it a matter of inserting
code
> in each rule to examine each token, looking for comment nodes (assume a
> unique channel for those).  Would you track the index of the last node
> checked so that you can get the range of tokens to examine?  It sounds
like
> this could get a bit messy.
> 
> Howard W. Nasgaard
> 
> 
> 
> antlr-interest-bounces at antlr.org wrote on 11/08/2010 04:46:12 PM:
> 
> > [image removed]
> >
> > Re: [antlr-interest] How do I preserve comments in a language to
> > language translator
> >
> > Jim Idle
> >
> > to:
> >
> > antlr-interest
> >
> > 11/08/2010 04:48 PM
> >
> > Sent by:
> >
> > antlr-interest-bounces at antlr.org
> >
> > This is a very tricky thing to perfectly, but not so difficult to do
> > as
> a
> > 'best guess' type of algorithm. For instance if the comments are found
> > before certain tokens and can be just pushed to the output before the
> > translated version (like doxygen comments or javadoc etc), or if
> 'comments
> > close by' is a reasonable guess. It is difficult to speak to you
> > problem generically, but some translations make this easy enough and
> > some very difficult.
> >
> > However, what you will need to do is locate the token that 'starts'
> > your construct output, then find its equivalent token position in the
> original
> > tokenized input stream. If the token in the tree is from the original
> input
> > stream then it is easy, otherwise you can use the user1, user2, user3
> fields
> > of a token to record the token that 'starts' the code you have
> translated or
> > perhaps the start and end tokens that are the comment block.
> >
> > Now, knowing the input token position, you can traverse backwards in
> > the token stream (use get and not LT as LT skips off channel tokens)
> > and
> find
> > the first of the comment tokens that precedes it (by checking the
> token's
> > channel). This will be easier if you set the comments to a particular
> > channel and not just HIDDEN (which is channel 99). When you know the
> token
> > position of the comment token, then you can traverse forwards and copy
> the
> > token text to the output (changing the comment lead-in characters
> > should
> you
> > need to) using the pointers available in the token (which point to the
> > original text).
> >
> > So, you just need to get familiar with asking the tree nodes for their
> > tokens and then asking the tokens what index they are and using the
> > get methods to access the tokens in the input stream.
> >
> > So:
> >
> > // A comment
> > // Another
> > // yet another
> > int Cfunc( ....
> >
> > So, if the comments are going on channel 2 then you will have:
> >
> > 0 COMMENT
> > 1 COMMENT
> > 2 COMMENT
> > 3 ID
> > 4 ID
> > 5 LPAREN
> >
> > Now, your first parser is probably going to generate ^(FUNCDECL ID ID
> .....)
> >
> > You can now attach the index of the first comment (0) to user1 and
> > then index of the last comment to user2 of say FUNCDECL, or the first
ID.
> > Assuming that the token is preserved through all the rewrites, then
> > this information will propagate to your final AST.
> >
> > Of course this is just illustrating what you need to do generally as I
> do
> > not know exactly what you are trying to do.
> >
> > Jim
> >
> > > -----Original Message-----
> > > From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> > > bounces at antlr.org] On Behalf Of Howard Nasgaard
> > > Sent: Wednesday, August 11, 2010 1:13 PM
> > > To: antlr-interest at antlr.org
> > > Subject: [antlr-interest] How do I preserve comments in a language
> > > to language translator
> > >
> > > I am writing a translator that will convert from one version of a
> language
> > to a
> > > newer version of that language.  The versions are syntactically
> similar so
> > their
> > > underlying ASTs are similar.  I am using parsers for the grammar and
> tree
> > > grammars generated as C++.  The old language is parsed and an AST is
> > built.
> > > Then numerous walks of the AST are done using generated tree
> grammars.
> > > One of the walks creates a new AST, the translation, which conforms
> > > to
> the
> > > tree hierarchy that describes the new language elements.  A final
> > > walk
> of
> > the
> > > new AST "pretty prints" the translation.
> > >
> > > As part of the translation walk, or whatever works, I would like to
> copy
> > as
> > > many of the comment tokens across to the new AST as possible.  Based
> on
> > > my reading, the comments are there as they are being directed to the
> > > HIDDEN channel.  It is just not clear how, in my tree grammar, I
> > > would
> > access
> > > them.  I have been unable to find any descriptions  of how to do
> > > this
> that
> > > apply to antlr3 and C++.
> > >
> > > Howard W. Nasgaard
> > >
> > > List: http://www.antlr.org/mailman/listinfo/antlr-interest
> > > Unsubscribe:
> > > http://www.antlr.org/mailman/options/antlr-interest/your-
> > > email-address
> >
> >
> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
> > Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/
> > your-email-address
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address