[antlr-interest] SeeGramWrap: Yet another refactoring

Wed Mar 3 17:19:56 PST 2004

I have placed a re-refactored version of "SeeGramWrap-03.02.2004.tgz"
on my webpage at "http://members.tripod.com/~edcjones/pycode.html".
SeeGramWrap parses a piece of C code and the resulting parse tree is
output in man and machine readable form. The result can be used for
program transformations. Since a particular trnsformation algorithm
may not require all the information present in the tree, the user can
select what to output.

This program has been written and tested only under linux.

Thanks to John Mitchell and Monty Zukowski for "cgram.tgz". Every
parser generator need to have a good C grammar. Also thanks to
Terrence Parr for ANTLR (http://www.antlr.org/).

==============================================================
                          CONTEXT

Python (http://www.python.org) is a scripting language that is both
easy to read and easy to write. It is so easy to read that I can
usually read my own code six months after I write it. But Python is
slow. If speed is needed for part of a project, Pythonistas write C
code that call functions in Python's large API. It also common to wrap
large C libraries so they can be called by Python. The wrapping code
is repetitive and there may be a lot of it so methods have been
developed for automated wrapping.

The best-known approach is SWIG (http://www.swig.org/). For complex
wrappings, SWIG requires the writing of "typemaps", an unintuitive
process where pieces of C code you write are spliced into the wrapper
code generated by SWIG.

Another wrapper related approach is Pyrex which is found at

    http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/

Pyrex has its own repetitive boilerplate that has to be written. But
the Pyrex boilerplate is so straightforward that it can be taught
algorithmically. See "Michael's Quick Guide to Pyrex" at
"http://ldots.org/pyrex-guide/".

I think that the Pyrex boilerplate is _so_ straightforward that it can
be machine generated. Therefore I have been sporatically developing
software to do this. A thoroughly buggy version of this is on my web
page, "http://members.tripod.com/~edcjones". It is called
"cgram.tar.gz" (The name will be changed). Look at it but don't use
it. "SeeGramWrap" is a major revision of the front end of "cgram.tr.gz".

I think the automatic-wrapper program can be made to work. It might be
easier to use than SWIG. It is still a lot of work to prepare complex
C header files. What we have is really a "program transformation" or
"tree transformation" problem.

I think some of the issues are:

1. Since parser generators have a long and steep learning curve, I
prefer to use them as black boxes which generate parsers which output
results that I can analyze using Python. The parser created by a
parser generator should output trees in two formats: one easy to look
at and another that a program can easily read. For examples, see below.

2. I find trees very easy to work with. I want the trees to be front
and center and highly visible. I prefer to "manipulate a tree" rather
than "fire a rule".

3. The most common type of C macro has a type as one of its arguments:

    #define CAST(x, type) (type *) x

How can these be automatically wrapped for Python which is a
dynamically typed language?

=============================================================
                    TECHNICAL OVERVIEW

I use some C grammars associated with ANTLR. The grammar package is
called "cgram". See "http://www.antlr.org/resources.html".

In "cgram" there is a java program "TestThrough.java" which parses C
code into an AST then runs a tree grammar on the AST and outputs the
original code. The tree grammar is named "GnuCEmitter.g". I work with
this grammar because the terminal tokens are printed in the correct
order. I modified the grammar turning it into a template. A piece of
the original "GnuCEmitter.g" is:

----
typeQualifier
        :       a:"const"                       { print( a ); }
        |       b:"volatile"                    { print( b ); }
        ;
----

The modified version is:

----
typeQualifier
        :       a:"const"                       { <@ a @> }
        |       b:"volatile"                    { <@ b @> }
        ;
----

In this template, strings of the form "<@ ... @>" will each be
replaced by a set of print statements. Moreover the entire rule will
be wrapped by prints.  The template is used in
"emitter/insert_prints.py". If "insert_prints.py" is run the result is:

----
typeQualifier
  { if ( inputState.guessing==0 ) {
          print(Open);
          print("typeQualifier");
       }
    }
        :  (
                a:"const"        {  print(Open);
print("typeQualifier.0"); print( a ); print(Close); }
        |       b:"volatile"     {  print(Open);
print("typeQualifier.1"); print( b ); print(Close); }
           )
  { currentOutput.print(Close + MyTokenSep); }
        ;
----

If the original C program , "temp2.c", is

    char* s = "ab";

The output of the modified emitter grammar is "temp2.c.data":

----
    <<OPEN>>                 <<OPEN>>                 <<OPEN>>
    externalList             declarator               expr
    <<OPEN>>                 <<OPEN>>                 <<OPEN>>
    externalDef              pointerGroup             primaryExpr
    <<OPEN>>                 <<OPEN>>                 <<OPEN>>
    declaration              pointerGroup.0           stringConst
    <<OPEN>>                 *                        <<OPEN>>
    declSpecifiers           <<CLOSE>>                stringConst.0
    <<OPEN>>                 <<CLOSE>>                "ab"
    typeSpecifier            <<OPEN>>                 <<CLOSE>>
    <<OPEN>>                 declarator.0             <<CLOSE>>
    typeSpecifier.1          s                        <<CLOSE>>
    char                     <<CLOSE>>                <<CLOSE>>
    <<CLOSE>>                <<CLOSE>>                <<CLOSE>>
    <<CLOSE>>                <<OPEN>>                 <<CLOSE>>
    <<CLOSE>>                initDecl.0               <<CLOSE>>
    <<OPEN>>                 =                        ;
    initDeclList             <<CLOSE>>                <<CLOSE>>
    <<OPEN>>                 <<OPEN>>                 <<CLOSE>>
    initDecl                 initializer              <<CLOSE>>
----

This output can be processed by "tree.py" to produce "temp2.c.nest"

----
(externalList,
  (externalDef,
    (declaration,
      (declSpecifiers,
        (typeSpecifier,
          (typeSpecifier.1, |char|))),
      (initDeclList,
        (initDecl,
          (declarator,
            (pointerGroup,
              (pointerGroup.0, |*|)),
            (declarator.0, |s|)),
          (initDecl.0, |=|),
          (initializer,
            (expr,
              (primaryExpr,
                (stringConst,
                  (stringConst.0, |"ab"|))))))), |;|)))
----

or "temp2.c.src":

    char * s = "ab" ;

If "temp2.c.src" is put through the entire process itself we get
"temp2.c.src.src" which is identical to "temp2.c.src". This test is
done by "docheck.py".

In the ".data" or ".nest" files the tokens from the original C code
are in the correct order. It is easy to recover

    ('char', '*', 's', '=', '"ab"', ';')

Thanks,
Ed Jones

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/