[antlr-interest] How do I preserve comments in a language to language translator

Jim Idle jimi at temporal-wave.com
Wed Aug 11 13:46:12 PDT 2010


This is a very tricky thing to perfectly, but not so difficult to do as a
'best guess' type of algorithm. For instance if the comments are found
before certain tokens and can be just pushed to the output before the
translated version (like doxygen comments or javadoc etc), or if 'comments
close by' is a reasonable guess. It is difficult to speak to you problem
generically, but some translations make this easy enough and some very
difficult.

However, what you will need to do is locate the token that 'starts' your
construct output, then find its equivalent token position in the original
tokenized input stream. If the token in the tree is from the original input
stream then it is easy, otherwise you can use the user1, user2, user3 fields
of a token to record the token that 'starts' the code you have translated or
perhaps the start and end tokens that are the comment block. 

Now, knowing the input token position, you can traverse backwards in the
token stream (use get and not LT as LT skips off channel tokens) and find
the first of the comment tokens that precedes it (by checking the token's
channel). This will be easier if you set the comments to a particular
channel and not just HIDDEN (which is channel 99). When you know the token
position of the comment token, then you can traverse forwards and copy the
token text to the output (changing the comment lead-in characters should you
need to) using the pointers available in the token (which point to the
original text). 

So, you just need to get familiar with asking the tree nodes for their
tokens and then asking the tokens what index they are and using the get
methods to access the tokens in the input stream.

So:

// A comment
// Another
// yet another
int Cfunc( ....

So, if the comments are going on channel 2 then you will have:

0 COMMENT 
1 COMMENT 
2 COMMENT 
3 ID 
4 ID 
5 LPAREN 

Now, your first parser is probably going to generate ^(FUNCDECL ID ID .....)

You can now attach the index of the first comment (0) to user1 and then
index of the last comment to user2 of say FUNCDECL, or the first ID.
Assuming that the token is preserved through all the rewrites, then this
information will propagate to your final AST.

Of course this is just illustrating what you need to do generally as I do
not know exactly what you are trying to do.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Howard Nasgaard
> Sent: Wednesday, August 11, 2010 1:13 PM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] How do I preserve comments in a language to
> language translator
> 
> I am writing a translator that will convert from one version of a language
to a
> newer version of that language.  The versions are syntactically similar so
their
> underlying ASTs are similar.  I am using parsers for the grammar and tree
> grammars generated as C++.  The old language is parsed and an AST is
built.
> Then numerous walks of the AST are done using generated tree grammars.
> One of the walks creates a new AST, the translation, which conforms to the
> tree hierarchy that describes the new language elements.  A final walk of
the
> new AST "pretty prints" the translation.
> 
> As part of the translation walk, or whatever works, I would like to copy
as
> many of the comment tokens across to the new AST as possible.  Based on
> my reading, the comments are there as they are being directed to the
> HIDDEN channel.  It is just not clear how, in my tree grammar, I would
access
> them.  I have been unable to find any descriptions  of how to do this that
> apply to antlr3 and C++.
> 
> Howard W. Nasgaard
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address



More information about the antlr-interest mailing list