[antlr-interest] Parsing documentation comments (with nesting!) (v3)

Thu Feb 22 00:49:46 PST 2007

Hi,

On Wed, Feb 21, 2007 at 07:36:59PM -0800, Rick Mann wrote:
> I've been working an a tool to create a symbol database for the D  
> programming language. This means that I don't need a complete parser,  
> just enough of one to identify a few "global" symbol definitions. I'm  
> doing okay with some language basics, but I'm running in to trouble  
> parsing comments. I have a couple of big questions.
> 
> If you're unfamiliar, D is a programming language that looks a lot  
> like C++ and Java. In particular, it has multiline comments delimited  
> by '/*' and '*/'. It has "to-EOL" comments that start with '//' and  
> go to the end of the line.
> 
> It also has nesting multiline comments. You can delimit a comment  
> with '/+' and '+/', and nest these arbitrarily deeply.

Have you seen the 'island' grammar in the v3 examples?

> A variant of each of these three denotes a Documentation Comment. If  
> a comment starts with '/**', '/++' or '///', it is considered  
> documentation, and applies to the symbols defined "nearby" (the  
> specific rules are not important). The comment itself has a structure  
> that would be nice to include in the overall grammar.
> 
> At the most basic level, I'd like to be able to get at the content of  
> a regular multiline comment. The beta book shows an example like this:
> 
> COMMENT
>     :    '/*' ( options {greedy=false;} : . )* '*/'
>     ;
> 
> I've tried this, and it works fine, but I can't get at the text of  
> the comment. I tried labeling the subrule, but it didn't like that.  
> So I tried this:
> 
> COMMENT
>     :    '/*'! COMMENTTEXT '*/'! { System.out.println("Found a  
> comment [" + $COMMENTTEXT.text + "]"); }
>     ;
> 
> fragment
> COMMENTTEXT
> options
> {
>     greedy = false;
> }
>     :    .*
>     ;
> 
> But I get "The following alternatives are unreachable: 1".
> 
> (Keep in mind, my grammar will eventually generate an AST, but right  
> now has code to help me debug and learn).
> 
> I'd like to parse the structure of the Doc Comments, which is  
> somewhat line-oriented, so getting each line in turn would be helpful.
> 
> Question 1: How would I write a grammar to accommodate this need?

FWIW, I leave the comment in it's 'natural' form, and strip the
start/end markers when I process the AST.  Doing it earlier is resonable
too though -- try using setText() in a COMMENT lexer action and doing
the obvious substring() work:

  http://www.antlr.org/wiki/pages/viewpage.action?pageId=1461

> Question 2: How can I write grammar to essentially skip a function  
> body? In D you can both declare and define functions, just like in C:
> 
> int foo(char x, int, long y);
> 
> or
> 
> int bar(char x, int, long y)
> {
> }
> 
> For my purposes, I don't care what happens inside the {}, but since  
> braces can nest arbitrarily deeply, I need to parse through it  
> properly. I'm having trouble understanding how to avoid the left  
> recursion that makes ANTLR choke. In any case, I suspect this grammar  
> will look just like the grammar for the nesting comments above,  
> except that I can throw out anything inside the body.

How about a construction like,

  function_body_skip
      :    LBRACE
           (any_token_but_brace | function_body_skip)*
           RBRACE
      ;

Where 'any_token_but_brace' is a parser rule that will match any token
allowed in function bodies.

ta,
dave

-- 
http://david.holroyd.me.uk/