[antlr-interest] Pre-processor advice [C target]

Thu Aug 30 13:27:15 PDT 2012

Hello all,

We have a DSL at my company, for which we have our own compiler written
in C/C++. It is very old, monstrous, and terribly written. A little over
a year ago, I successfully replaced the lexer and parser with an ANTLR
implementation, and now I am tasked with replacing the preprocessor. I
am writing to ask for some general advice on the best approach for this.

The current process is such that we read the source file from disk into
a memory buffer. The preprocessor works on this buffer, doing text
transformations as necessary. This string is then passed into
antlr3StringStreamNew(), and the ANTLR lexer and parser take over from
there, ultimately executing the semantic actions that produce our binary
object code. Ideally, the preprocessor would be a drop-in replacement in
this process.

The set of preprocessor commands is relatively short, and fairly
typical:

#include, #define, #undef,  #ifdef, #else, #elseif, #endif, #nosubst,
#subst (these last 2 basically just switch the #define substitution off
and on for a block of code)

There are a few requirements that complicate this a bit:

1.       The original line numbers must be preserved for later stages
(for error messages, and status at runtime), even after multi-line macro
substitutions

2.       The rules for #define substitution are very complex. The
allowed identifier for the macro name can contain any symbols, except
for white space. The crazy thing is though, when searching the code text
for possible substitutions, non-alphanumeric symbols are treated as both
delimiters and not. The current algorithm is to identify tokens using
white-space as a true delimiter, then identify all possible sub-tokens
based on these partial delimiters. Each candidate sub-token is looked up
in the table of defines, and if there is a match, the text is
substituted. It does these largest to smallest, moving on once a
substitution is found, or all possible tokens were tried. I suspect that
I will still be doing this sub-token parsing and substitution by hand,
since I don't think ANTLR supports overlapping tokens like these (but I
would love to hear if someone has done something like this).

3.       Add support for function-like macros (text substitution with
arguments).

I have spent some time searching the mailing lists and re-reading the
ANTLR book, where I found some hints, but no clear-cut solution to my
problems. String templates and TokenRewriteStream look the most
promising, but as far as I can tell the TokenRewriteStream has not been
implemented in the C target runtime. Can anyone suggest what options
might be available to me, given these requirements?

Thank you!

- Justin Murray