[antlr-interest] Interface between a C preprocessor and the C grammar parsers

Wed Mar 18 11:20:19 PDT 2009

Vincent De Groote wrote:
> Hello,
> 
> I need to build a program which reads C language files, modifies the 
> files (code transformations), then save the files back to disk. The 
> saved files must have the original file structure, with unexpanded 
> include files,  unexpanded macros, inactive lines (skipped by the 
> preprocessor), ...  Beside the code rewrite functionnality, the program 
> must also be able to reformat source code, based on its syntactic structure.
> 
> This means that the tokens hidden to the grammatical parser must be 
> accessible to the final application.
> 
> I'm really a newbie in parsing, and I need some advices on how to do this.
> 
> My first questions are about the interface between the preprocessor and 
> the C grammar parser:
> 
> - Should the preprocessor parser be embedded in the C grammar ?  (This 
> seems a little ugly)

Yes. I embed the following in my C grammar:
ppline
    : ppdefine
    | ppinclude
    | ppif
    ;

ppdefine
    : PPdefine^ ID ( ppdefineArgs )? (
        constExpr
        | ID
        | Number
        | StringLiteral
        |   // empty
    )
    ;

ppdefineArgs
    : LPAREN! ( ID )? ( COMMA! ID )* RPAREN!
                { ## = #( #[NPPdefineArgs], ## );}
    ;

ppinclude
    : ( PPinclude^ LT ) => PPinclude^ LT! ( ~(GT) )+ GT!
    | PPinclude^ StringLiteral
    ;

ppif
    // need to do lookahead because "#if", "#ifdef", and "#ifndef" are the same
for k==3
    : ( ( PPifndef ) => PPifndef^ | ( PPifdef ) => PPifdef^ | ( PPif ) => PPif^
)
            ppexpr ( ppline )+ ( ppelif )* ( ppelse )? PPendif!
    ;

/*
ppifdef
    // we could use "ID" instead of ppexpr below, but this way it's easier
    // to work with the AST
    : PPifdef^ ppexpr ( ppline )+ ( ppelif )* ( ppelse )? PPendif!
    ;
*/

ppelif
    : PPelif! ppexpr ( ppline )*
                { ## = #( #[NPPelif], ## );}
    ;

ppelse
    : PPelse! ( ppline )*
                { ## = #( #[NPPelse], ## );}
    ;

ppexpr
    : ppAndExpr ( LOR^ ppAndExpr )*
                { ## = #( #[NPPexpr], ## );}
    ;

ppAndExpr
    : ppNotExpr ( LAND^ ppNotExpr )*
    ;

ppNotExpr
    : ( LNOT^ )? ppExprTerminal
    ;

ppExprTerminal
    : ( PPDEFINED ) => PPDEFINED^ LPAREN! ID RPAREN!
    | ID
    | Number
    ;

> - Should the preprocessor parser be a syntaxical parser (with 
> productions like active/incative lines, start and end of includes, ...), 
> or a lexical parser ?

People generally call them a "lexer" and a "parser", and ANTLR generates
both from a single grammar. The rules above are parser rules. You'll also
need to add these lexer rules:
PPDEFINED       : "defined";
PPdefine    : "#define" ;
PPif        : "#if" ;
PPelse      : "#else" ;
PPelif      : "#elif" ;
PPendif     : "#endif" ;
PPinclude   : "#include" ;
PPifdef     : "#ifdef" ;
PPifndef    : "#ifndef";

and these "imaginary tokens":
        |       NPPblock
        |       NPPline
        |       NPPelif
        |       NPPelse

> -  What should this preprocessor parser return ? 
>    - A list of tokens (with their channel set to hidden / visible) (is 
> it possible for a grammatical parser to return a token list) ?
>    - A tree structure with the structure of the file ?
>    - Something other ?

If you just enhance the C parser to handle preprocessor directives like this,
the preprocessor stuff will just show up as nodes in your C parser's AST.

> 
> Other questions about the C grammar parser:
> 
> In the reference book (The Definitive ANTLR Reference: Building Domain 
> Specific languages), I read that an AST should not contain syntax-only 
> tokens, like the ';' statement separator, parentheses used to change 
> operation precedence ...  I do not understand why  an AST should not 
> contain such tokens.  I suppose they are just useless in an AST.  Are 
> there other reasons ?

No, just that they're useless. Note that the "cgram" ANTLR grammar
does put a lot of these useless tokens in the AST. I usually find it
easier to just ignore them when processing the AST, rather than futzing with
the grammar to stop them from from going into the AST.

> 
> This book is well written, but I'm not sure to be able to select the 
> best choice  between AST, Tree, custom made structures ...

Start by using AST, then if/when you start needing features that AST doesn't have,
subclass it.

> 
> If the AST is not the good structure to return the parsed grammar to the 
> caller, I suppose I could use custom made structures.  But is that the 
> best choice ?

AST certainly has the basic tree operations that you'll need, so don't
start from scratch. Subclass AST as needed.
> 
> I do not understand very well the differences between an abstract tree 
> and a concrete tree (I'm really a newbie ...). 
> Some hints about these differences are welcome.

The AST is the essence of the parsed source code, stored in a tree-like
data structure. Don't worry about a concrete tree, you won't be using it.
> 
> 
> Thanks for your replies,
> 
> Vincent De Groote

Your task is far harder than it looks.
I'm currently working on a thing that just adds "printf" calls 
after every assignment in C code. It's quite amazing how 
difficult it is, especially after seeing how easy it is
for "C-" in the book. 

The outputting of formatted code will be the easy part.
You can either use a treewalker or my "by hand" approach:
http://www.jazillian.com/antlr/emitter.html

Good luck. 
Andy