[antlr-interest] Parsing documentation comments (with nesting!) (v3)
David Holroyd
dave at badgers-in-foil.co.uk
Thu Feb 22 00:49:46 PST 2007
Hi,
On Wed, Feb 21, 2007 at 07:36:59PM -0800, Rick Mann wrote:
> I've been working an a tool to create a symbol database for the D
> programming language. This means that I don't need a complete parser,
> just enough of one to identify a few "global" symbol definitions. I'm
> doing okay with some language basics, but I'm running in to trouble
> parsing comments. I have a couple of big questions.
>
> If you're unfamiliar, D is a programming language that looks a lot
> like C++ and Java. In particular, it has multiline comments delimited
> by '/*' and '*/'. It has "to-EOL" comments that start with '//' and
> go to the end of the line.
>
> It also has nesting multiline comments. You can delimit a comment
> with '/+' and '+/', and nest these arbitrarily deeply.
Have you seen the 'island' grammar in the v3 examples?
> A variant of each of these three denotes a Documentation Comment. If
> a comment starts with '/**', '/++' or '///', it is considered
> documentation, and applies to the symbols defined "nearby" (the
> specific rules are not important). The comment itself has a structure
> that would be nice to include in the overall grammar.
>
> At the most basic level, I'd like to be able to get at the content of
> a regular multiline comment. The beta book shows an example like this:
>
> COMMENT
> : '/*' ( options {greedy=false;} : . )* '*/'
> ;
>
> I've tried this, and it works fine, but I can't get at the text of
> the comment. I tried labeling the subrule, but it didn't like that.
> So I tried this:
>
> COMMENT
> : '/*'! COMMENTTEXT '*/'! { System.out.println("Found a
> comment [" + $COMMENTTEXT.text + "]"); }
> ;
>
> fragment
> COMMENTTEXT
> options
> {
> greedy = false;
> }
> : .*
> ;
>
> But I get "The following alternatives are unreachable: 1".
>
> (Keep in mind, my grammar will eventually generate an AST, but right
> now has code to help me debug and learn).
>
> I'd like to parse the structure of the Doc Comments, which is
> somewhat line-oriented, so getting each line in turn would be helpful.
>
> Question 1: How would I write a grammar to accommodate this need?
FWIW, I leave the comment in it's 'natural' form, and strip the
start/end markers when I process the AST. Doing it earlier is resonable
too though -- try using setText() in a COMMENT lexer action and doing
the obvious substring() work:
http://www.antlr.org/wiki/pages/viewpage.action?pageId=1461
> Question 2: How can I write grammar to essentially skip a function
> body? In D you can both declare and define functions, just like in C:
>
> int foo(char x, int, long y);
>
> or
>
> int bar(char x, int, long y)
> {
> }
>
> For my purposes, I don't care what happens inside the {}, but since
> braces can nest arbitrarily deeply, I need to parse through it
> properly. I'm having trouble understanding how to avoid the left
> recursion that makes ANTLR choke. In any case, I suspect this grammar
> will look just like the grammar for the nesting comments above,
> except that I can throw out anything inside the body.
How about a construction like,
function_body_skip
: LBRACE
(any_token_but_brace | function_body_skip)*
RBRACE
;
Where 'any_token_but_brace' is a parser rule that will match any token
allowed in function bodies.
ta,
dave
--
http://david.holroyd.me.uk/
More information about the antlr-interest
mailing list