[antlr-interest] SeeGramWrap: Yet another refactoring
Terence Parr
parrt at cs.usfca.edu
Wed Mar 3 17:54:42 PST 2004
Wow! Great work, Ed! Should we make the text of this an "article" at
antlr.org?
Terence
On Mar 3, 2004, at 5:19 PM, edcjones wrote:
> I have placed a re-refactored version of "SeeGramWrap-03.02.2004.tgz"
> on my webpage at "http://members.tripod.com/~edcjones/pycode.html".
> SeeGramWrap parses a piece of C code and the resulting parse tree is
> output in man and machine readable form. The result can be used for
> program transformations. Since a particular trnsformation algorithm
> may not require all the information present in the tree, the user can
> select what to output.
>
> This program has been written and tested only under linux.
>
> Thanks to John Mitchell and Monty Zukowski for "cgram.tgz". Every
> parser generator need to have a good C grammar. Also thanks to
> Terrence Parr for ANTLR (http://www.antlr.org/).
>
> ==============================================================
> CONTEXT
>
> Python (http://www.python.org) is a scripting language that is both
> easy to read and easy to write. It is so easy to read that I can
> usually read my own code six months after I write it. But Python is
> slow. If speed is needed for part of a project, Pythonistas write C
> code that call functions in Python's large API. It also common to wrap
> large C libraries so they can be called by Python. The wrapping code
> is repetitive and there may be a lot of it so methods have been
> developed for automated wrapping.
>
> The best-known approach is SWIG (http://www.swig.org/). For complex
> wrappings, SWIG requires the writing of "typemaps", an unintuitive
> process where pieces of C code you write are spliced into the wrapper
> code generated by SWIG.
>
> Another wrapper related approach is Pyrex which is found at
>
> http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
>
> Pyrex has its own repetitive boilerplate that has to be written. But
> the Pyrex boilerplate is so straightforward that it can be taught
> algorithmically. See "Michael's Quick Guide to Pyrex" at
> "http://ldots.org/pyrex-guide/".
>
> I think that the Pyrex boilerplate is _so_ straightforward that it can
> be machine generated. Therefore I have been sporatically developing
> software to do this. A thoroughly buggy version of this is on my web
> page, "http://members.tripod.com/~edcjones". It is called
> "cgram.tar.gz" (The name will be changed). Look at it but don't use
> it. "SeeGramWrap" is a major revision of the front end of
> "cgram.tr.gz".
>
> I think the automatic-wrapper program can be made to work. It might be
> easier to use than SWIG. It is still a lot of work to prepare complex
> C header files. What we have is really a "program transformation" or
> "tree transformation" problem.
>
> I think some of the issues are:
>
> 1. Since parser generators have a long and steep learning curve, I
> prefer to use them as black boxes which generate parsers which output
> results that I can analyze using Python. The parser created by a
> parser generator should output trees in two formats: one easy to look
> at and another that a program can easily read. For examples, see below.
>
> 2. I find trees very easy to work with. I want the trees to be front
> and center and highly visible. I prefer to "manipulate a tree" rather
> than "fire a rule".
>
> 3. The most common type of C macro has a type as one of its arguments:
>
> #define CAST(x, type) (type *) x
>
> How can these be automatically wrapped for Python which is a
> dynamically typed language?
>
> =============================================================
> TECHNICAL OVERVIEW
>
> I use some C grammars associated with ANTLR. The grammar package is
> called "cgram". See "http://www.antlr.org/resources.html".
>
> In "cgram" there is a java program "TestThrough.java" which parses C
> code into an AST then runs a tree grammar on the AST and outputs the
> original code. The tree grammar is named "GnuCEmitter.g". I work with
> this grammar because the terminal tokens are printed in the correct
> order. I modified the grammar turning it into a template. A piece of
> the original "GnuCEmitter.g" is:
>
> ----
> typeQualifier
> : a:"const" { print( a ); }
> | b:"volatile" { print( b ); }
> ;
> ----
>
> The modified version is:
>
> ----
> typeQualifier
> : a:"const" { <@ a @> }
> | b:"volatile" { <@ b @> }
> ;
> ----
>
> In this template, strings of the form "<@ ... @>" will each be
> replaced by a set of print statements. Moreover the entire rule will
> be wrapped by prints. The template is used in
> "emitter/insert_prints.py". If "insert_prints.py" is run the result is:
>
> ----
> typeQualifier
> { if ( inputState.guessing==0 ) {
> print(Open);
> print("typeQualifier");
> }
> }
> : (
> a:"const" { print(Open);
> print("typeQualifier.0"); print( a ); print(Close); }
> | b:"volatile" { print(Open);
> print("typeQualifier.1"); print( b ); print(Close); }
> )
> { currentOutput.print(Close + MyTokenSep); }
> ;
> ----
>
> If the original C program , "temp2.c", is
>
> char* s = "ab";
>
> The output of the modified emitter grammar is "temp2.c.data":
>
> ----
> <<OPEN>> <<OPEN>> <<OPEN>>
> externalList declarator expr
> <<OPEN>> <<OPEN>> <<OPEN>>
> externalDef pointerGroup primaryExpr
> <<OPEN>> <<OPEN>> <<OPEN>>
> declaration pointerGroup.0 stringConst
> <<OPEN>> * <<OPEN>>
> declSpecifiers <<CLOSE>> stringConst.0
> <<OPEN>> <<CLOSE>> "ab"
> typeSpecifier <<OPEN>> <<CLOSE>>
> <<OPEN>> declarator.0 <<CLOSE>>
> typeSpecifier.1 s <<CLOSE>>
> char <<CLOSE>> <<CLOSE>>
> <<CLOSE>> <<CLOSE>> <<CLOSE>>
> <<CLOSE>> <<OPEN>> <<CLOSE>>
> <<CLOSE>> initDecl.0 <<CLOSE>>
> <<OPEN>> = ;
> initDeclList <<CLOSE>> <<CLOSE>>
> <<OPEN>> <<OPEN>> <<CLOSE>>
> initDecl initializer <<CLOSE>>
> ----
>
> This output can be processed by "tree.py" to produce "temp2.c.nest"
>
> ----
> (externalList,
> (externalDef,
> (declaration,
> (declSpecifiers,
> (typeSpecifier,
> (typeSpecifier.1, |char|))),
> (initDeclList,
> (initDecl,
> (declarator,
> (pointerGroup,
> (pointerGroup.0, |*|)),
> (declarator.0, |s|)),
> (initDecl.0, |=|),
> (initializer,
> (expr,
> (primaryExpr,
> (stringConst,
> (stringConst.0, |"ab"|))))))), |;|)))
> ----
>
> or "temp2.c.src":
>
> char * s = "ab" ;
>
> If "temp2.c.src" is put through the entire process itself we get
> "temp2.c.src.src" which is identical to "temp2.c.src". This test is
> done by "docheck.py".
>
> In the ".data" or ".nest" files the tokens from the original C code
> are in the correct order. It is easy to recover
>
> ('char', '*', 's', '=', '"ab"', ';')
>
> Thanks,
> Ed Jones
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
--
Professor Comp. Sci., University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list