[antlr-interest] Interface between a C preprocessor and the C grammar parsers
Andy Tripp
antlr at jazillian.com
Wed Mar 18 11:20:19 PDT 2009
Vincent De Groote wrote:
> Hello,
>
> I need to build a program which reads C language files, modifies the
> files (code transformations), then save the files back to disk. The
> saved files must have the original file structure, with unexpanded
> include files, unexpanded macros, inactive lines (skipped by the
> preprocessor), ... Beside the code rewrite functionnality, the program
> must also be able to reformat source code, based on its syntactic structure.
>
> This means that the tokens hidden to the grammatical parser must be
> accessible to the final application.
>
> I'm really a newbie in parsing, and I need some advices on how to do this.
>
> My first questions are about the interface between the preprocessor and
> the C grammar parser:
>
> - Should the preprocessor parser be embedded in the C grammar ? (This
> seems a little ugly)
Yes. I embed the following in my C grammar:
ppline
: ppdefine
| ppinclude
| ppif
;
ppdefine
: PPdefine^ ID ( ppdefineArgs )? (
constExpr
| ID
| Number
| StringLiteral
| // empty
)
;
ppdefineArgs
: LPAREN! ( ID )? ( COMMA! ID )* RPAREN!
{ ## = #( #[NPPdefineArgs], ## );}
;
ppinclude
: ( PPinclude^ LT ) => PPinclude^ LT! ( ~(GT) )+ GT!
| PPinclude^ StringLiteral
;
ppif
// need to do lookahead because "#if", "#ifdef", and "#ifndef" are the same
for k==3
: ( ( PPifndef ) => PPifndef^ | ( PPifdef ) => PPifdef^ | ( PPif ) => PPif^
)
ppexpr ( ppline )+ ( ppelif )* ( ppelse )? PPendif!
;
/*
ppifdef
// we could use "ID" instead of ppexpr below, but this way it's easier
// to work with the AST
: PPifdef^ ppexpr ( ppline )+ ( ppelif )* ( ppelse )? PPendif!
;
*/
ppelif
: PPelif! ppexpr ( ppline )*
{ ## = #( #[NPPelif], ## );}
;
ppelse
: PPelse! ( ppline )*
{ ## = #( #[NPPelse], ## );}
;
ppexpr
: ppAndExpr ( LOR^ ppAndExpr )*
{ ## = #( #[NPPexpr], ## );}
;
ppAndExpr
: ppNotExpr ( LAND^ ppNotExpr )*
;
ppNotExpr
: ( LNOT^ )? ppExprTerminal
;
ppExprTerminal
: ( PPDEFINED ) => PPDEFINED^ LPAREN! ID RPAREN!
| ID
| Number
;
> - Should the preprocessor parser be a syntaxical parser (with
> productions like active/incative lines, start and end of includes, ...),
> or a lexical parser ?
People generally call them a "lexer" and a "parser", and ANTLR generates
both from a single grammar. The rules above are parser rules. You'll also
need to add these lexer rules:
PPDEFINED : "defined";
PPdefine : "#define" ;
PPif : "#if" ;
PPelse : "#else" ;
PPelif : "#elif" ;
PPendif : "#endif" ;
PPinclude : "#include" ;
PPifdef : "#ifdef" ;
PPifndef : "#ifndef";
and these "imaginary tokens":
| NPPblock
| NPPline
| NPPelif
| NPPelse
> - What should this preprocessor parser return ?
> - A list of tokens (with their channel set to hidden / visible) (is
> it possible for a grammatical parser to return a token list) ?
> - A tree structure with the structure of the file ?
> - Something other ?
If you just enhance the C parser to handle preprocessor directives like this,
the preprocessor stuff will just show up as nodes in your C parser's AST.
>
> Other questions about the C grammar parser:
>
> In the reference book (The Definitive ANTLR Reference: Building Domain
> Specific languages), I read that an AST should not contain syntax-only
> tokens, like the ';' statement separator, parentheses used to change
> operation precedence ... I do not understand why an AST should not
> contain such tokens. I suppose they are just useless in an AST. Are
> there other reasons ?
No, just that they're useless. Note that the "cgram" ANTLR grammar
does put a lot of these useless tokens in the AST. I usually find it
easier to just ignore them when processing the AST, rather than futzing with
the grammar to stop them from from going into the AST.
>
> This book is well written, but I'm not sure to be able to select the
> best choice between AST, Tree, custom made structures ...
Start by using AST, then if/when you start needing features that AST doesn't have,
subclass it.
>
> If the AST is not the good structure to return the parsed grammar to the
> caller, I suppose I could use custom made structures. But is that the
> best choice ?
AST certainly has the basic tree operations that you'll need, so don't
start from scratch. Subclass AST as needed.
>
> I do not understand very well the differences between an abstract tree
> and a concrete tree (I'm really a newbie ...).
> Some hints about these differences are welcome.
The AST is the essence of the parsed source code, stored in a tree-like
data structure. Don't worry about a concrete tree, you won't be using it.
>
>
> Thanks for your replies,
>
> Vincent De Groote
Your task is far harder than it looks.
I'm currently working on a thing that just adds "printf" calls
after every assignment in C code. It's quite amazing how
difficult it is, especially after seeing how easy it is
for "C-" in the book.
The outputting of formatted code will be the easy part.
You can either use a treewalker or my "by hand" approach:
http://www.jazillian.com/antlr/emitter.html
Good luck.
Andy
More information about the antlr-interest
mailing list