[antlr-interest] Preserve source code comments

Sat Oct 27 15:29:29 PDT 2012

This is what I did, in Python, because Python allowed me to:

def extract_comments(tokens): // these come from parser.input.tokens
    ''' Extract comments from a token list by looking at their channel.
    '''
    def _extract():
        for t in tokens:
            if t.getChannel() in (COMMENT_CHANNEL, EOL_COMMENT_CHANNEL):
                comment_text = t.getText()[2:].rstrip()
                if comment_text:
                    yield (t.getLine(), comment_text)
    return list(_extract())

def assign_comments(tree, comments):
    ''' Match comments to tree nodes. *tree* is an AST as returned by
        ANTLR, and *comments* is a list of tuples (line, comment).
    '''
    # Match tree nodes to all comments above it that
    # haven't yet been matched

    # flatten a tree, depth-first
    def flatten(t):
        result = [(t.getLine(), t)]
        if tree.children:
            for c in t.children:
                result += flatten(c)
        return result

    nodes = flatten(tree)
    # sort comments, just in case
    comments = list(sorted(comments))
    cindex = 0
    for lineno, node in nodes:
        node.comments = []
        while cindex < len(comments) and comments[cindex][0] <= lineno:
            node.comments.append(comments[cindex][1])
            cindex += 1

I only had to set the right channels in the lexer grammar. No new tree
adapters, token, or tree types (I guess it's one of the advantages of
working with a language with duck-typing).

I used this approach because the heuristics I saw and thought about did not
require that comments be associated with AST nodes during parsing, and
doing it during parsing seemed quite complicated.

In fact, some quite complex heuristics for associating comments to nodes
can be implemented more easily with this simple post-processing approach.
The heuristic above is that comments above or on a line with tokens belong
to the first token in the line. It could be improved to honor EOL comments,
but I'm using EOL comments for something else.

-- Juanca

On Thu, Oct 18, 2012 at 12:37 PM, Juancarlo Añez <apalala at gmail.com> wrote:

> I know this question has been asked before, but the threads about it are
> old and inconclusive.
>
> I need to associate source code comments with the nearest parsed token.
> Everything else I need to do I can do post-parsing.
>
> Could I get away with using a custom token type that grabs the nearest
> comment on its constructor?
>
> All I need is a basic recipe.
>
> TIA,
>
> --
> Juancarlo *Añez*
>

-- 
Juancarlo *Añez*