[antlr-interest] Trying to keep whitespace in an AST

Fri Feb 8 10:22:37 PST 2008

Well, remember that the AST is, err abstract ;-). It is just a construct 
made from the token stream that you parsed. The parser skips tokens that 
you create "off-channel", such as comments:

COMMENT: '//' ~NL*  { $channel = 2; } ;

Now, when you walk you AST and find a method, you just need the token 
index of the start sequence of your method declaration (this of course 
depends on the language). Then you can traverse backwards in the token 
stream (the stream you passed to the parser, mostly CommonTokenStream) 
for that index, and pick up any off-channel tokens that were ignored by 
the parser. If your common token stream is called tstream, then:

tstream.get(index) will return the token at index n, whether it is on 
the parsing channel or not. There is also tstrem.getRange(.., which will 
return a List of the tokens in a range, whether on channel or off 
channel.

So, you hit the 'method' keyword/node/token and find out its index (or 
the index of a real token rather than an imaginary one perhaps). Then 
you traverse back through the stream until some trigger point such as 
the first on-channel token before the comments or something. Only you 
can know exactly where you start and stop, and the problem of 
associating comments with the correct syntactical element is a thorny 
one!

Jim

> -----Original Message-----
> From: Jamie Penney [mailto:jpen054 at ec.auckland.ac.nz]
> Sent: Thursday, February 07, 2008 7:51 PM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Trying to keep whitespace in an AST
> 
> Hi all,
> I am trying to work out how to create a grammar that will build an AST
> that keeps both comments and some whitespace. Basically the output 
will
> be formatted code, but we need the semantic information provided by 
the
> AST for other parts of the system. Any comments and blank lines need 
to
> be kept in the output code. Is it possible to have rewriting and AST
> generation turned on at the same time, or do I have to write two
> separate grammars? I am new to ANTLR so sorry if I have the wrong idea
> about anything.
> To give a concrete example, say I have a language that represents 
basic
> C style statements like so:
> 
> int a    = 0;
> int b    = 1;
> int c    = 2;
> 
> // reassign a
> a = b + c;
> 
> What I need is the semantic information provided by an AST (whether a
> statement is a declaration, assignment, ect), but I need to transform
> the language partially too. I need to format the individual elements
> consistently, so each would be of the form a = b + c; but I also need
> to
> retain the newlines and comments between elements.
> 
> If anyone could point me in the right direction I would be very
> grateful.
> 
> Thanks,
> Jamie Penney