[antlr-interest] How to do preprocessing in antlr v4?

Sun Nov 18 06:51:27 PST 2012

Hello Martin,

I suppose you want to parse a C program to extract information from macros.
So you have to write a grammar that recognizes C preprocessor #define and
ignores all other lines.

 It looks to me that preprocessor does almost the same work as the lexer.

The C preprocessor must certainly tokenize (lexer job) and then interpret
to do replacements, that is do some parsing job.
On the contrary an ANTLR lexer will only follow your lexer rules to
tokenize the input. If you have for example :

if_line
    :   '#' 'if' constant_expression
    |   '#' 'ifdef'  ID
    |   '#' 'ifndef' ID
    ;

and an input

#ifndef HAVE_STDLIB_H
char *getenv();
#endif

the generated lexer will crete the following tokens : T1='#', T2='ifndef',
T3=ID='HAVE_STDLIB', etc. Then the generated parser will match these tokens
with the third alternative of rule if_line .

I have just written a C grammar, but without preprocessor (I use gcc -E to
preprocess). As an exercise, I have written the following grammar, tested
only on a 1'200 lines program, that you can use as a starting point. Refine
pp_define and token_sequence depending on what you want to capture in
macros.

Having said that, if your need is doing some preprocessing before parsing
the C program, you would need a full C grammar combined with a full C
preprocessor, and maybe use the TokenStreamRewriter feature described in
paragraphs Rewriting the Input Stream on page 54 and Accessing Hidden
Channels on page 208 of the beta 3 book.

 HTH
Bernard

grammar Cmacros;

/* Process #define statements in a C file.
   TODO : develop token_sequence
*/

program : translation_unit ;

translation_unit
@init {System.out.println("Cmacros last update 1436");}
    :   ( '#' preprocessor
    |     ignore
    |     NL
        )+
    ;

preprocessor
    :   pp_define
    |   pp_ignore
    ;

pp_define
    :   'define' ID '(' ID ( ',' ID )* ')' token_sequence
    |   'define' ID token_sequence
    ;

pp_ignore
    :   ignore
    ;

token_sequence
    :   ignore
    ;

ignore
    :   ~NL+ NL
    ;

CHAR
    :   '\'' ( '\\'? . )+? '\'' ;

COMMENT
    :    '/*' .*? '*/' -> channel(HIDDEN)
    ;

HEXADECIMAL
    :   '0' [xX] [0-9a-fA-F]+
    ;

ID  :   ( ID_FIRST (ID_FIRST | DIGIT)* )
    ;

INT :   DIGIT+ ;

//NL  :   '\r'? '\n' -> channel(WHITESPACE) ;  // channel(1)
//NL  :   '\n' -> channel(HIDDEN) ;
NL  :   '\n' ;

SL_COMMENT
    :   '//' .*? '\n' -> channel(HIDDEN)
    ;

SPECIAL
    :   '+' | '-' | '*' | '/' | '%' | '&' | '|' | '(' | ')' | '{' | '}' |
'[' | ']'
    |   '^' | '!' | '<' | '>' | '=' | ',' | '.' | ';' | ':' | '?'
    ;

STRING
    :   '"' ( '\\'? . )*? '"' ;

WS  :   [ \t]+ -> channel(HIDDEN) ;

fragment DIGIT  : [0-9] ;

fragment ID_FIRST : LETTER | '_' ;

fragment LETTER : [a-zA-Z] ;

2012/11/18 Martin d'Anjou <martin.danjou14 at gmail.com>

> ...

What is the right approach to implement preprocessor directives ...  in
> Antlr v4?  ...
> Macro parameters are the reason why I want to tokenize the input to the
> preprocessor. So I want lexer -> preprocessor -> parser.
>
>