[antlr-interest] Parsing documentation comments (with nesting!) (v3)

Wed Feb 21 19:36:59 PST 2007

Antlr v3b6.

I've been working an a tool to create a symbol database for the D  
programming language. This means that I don't need a complete parser,  
just enough of one to identify a few "global" symbol definitions. I'm  
doing okay with some language basics, but I'm running in to trouble  
parsing comments. I have a couple of big questions.

If you're unfamiliar, D is a programming language that looks a lot  
like C++ and Java. In particular, it has multiline comments delimited  
by '/*' and '*/'. It has "to-EOL" comments that start with '//' and  
go to the end of the line.

It also has nesting multiline comments. You can delimit a comment  
with '/+' and '+/', and nest these arbitrarily deeply.

A variant of each of these three denotes a Documentation Comment. If  
a comment starts with '/**', '/++' or '///', it is considered  
documentation, and applies to the symbols defined "nearby" (the  
specific rules are not important). The comment itself has a structure  
that would be nice to include in the overall grammar.

At the most basic level, I'd like to be able to get at the content of  
a regular multiline comment. The beta book shows an example like this:

COMMENT
     :    '/*' ( options {greedy=false;} : . )* '*/'
     ;

I've tried this, and it works fine, but I can't get at the text of  
the comment. I tried labeling the subrule, but it didn't like that.  
So I tried this:

COMMENT
     :    '/*'! COMMENTTEXT '*/'! { System.out.println("Found a  
comment [" + $COMMENTTEXT.text + "]"); }
     ;

fragment
COMMENTTEXT
options
{
     greedy = false;
}
     :    .*
     ;

But I get "The following alternatives are unreachable: 1".

(Keep in mind, my grammar will eventually generate an AST, but right  
now has code to help me debug and learn).

I'd like to parse the structure of the Doc Comments, which is  
somewhat line-oriented, so getting each line in turn would be helpful.

Question 1: How would I write a grammar to accommodate this need?

-------------

Question 2: How can I write grammar to essentially skip a function  
body? In D you can both declare and define functions, just like in C:

int foo(char x, int, long y);

or

int bar(char x, int, long y)
{
}

For my purposes, I don't care what happens inside the {}, but since  
braces can nest arbitrarily deeply, I need to parse through it  
properly. I'm having trouble understanding how to avoid the left  
recursion that makes ANTLR choke. In any case, I suspect this grammar  
will look just like the grammar for the nesting comments above,  
except that I can throw out anything inside the body.

I'd really appreciate any help anyone can give. Thank you!

-- 
Rick