[antlr-interest] Jazillian C to Java translator

Andy Tripp atripp at jazillian.com
Fri Dec 16 06:15:47 PST 2005


Hi guys,

Thanks for your interest in Jazillian.
This page: http://www.jazillian.com/does.html gives a flavor for what it 
does, at a high level.
This page: http://www.jazillian.com/how.html gives a fairly complete 
list of the various transformation
rules.

As for a design/implementation document, I don't have a complete one yet 
so I'll try to explain a bit here.
I apply a series of about 150 transformation rules. Each rule consists 
of a match() method that
does a pattern match, and an apply() with does the transformation. For 
example, the PrintfRule.match()
method just looks for a "printf" token, and the PrintfRule.apply() 
method replaces all the tokens in the
printf statement with tokens for an equivalent Java System.out.println() 
call.

I have a full class hierarchy of rules. At the top are various abstract 
classes, but also things like
FileRenameRule, which just renames files to match Java naming 
conventions. At the bottom of the
rule hierarchy are things like SnippetsRule, which reads a file 
containing simple text transformation rules like:

NULL --> 0   // replace NULL with 0
unsigned -->         // Java has no unsigned keyword, just delete it
v1 = sizeof v2 / sizeof (char *); --> v1 = v2.length;  // just one of 
many ways to get string length in C

Note that the pattern matcher allows placeholders such as "v1" and "v2" 
above, for which any variable
can be matched. There is also an "x" placeholder which will match anything:
if (false) { x } -->      // just delete any code that will never get used.

In addition to the 150 rules written in Java, there are hundreds of 
these simple pattern-match-and-replace rules.
All of this is done with streams of tokens on a set of C files. I have 
built a library of various functions for
working with these token streams for:

matching patterns and doing replacements, inserts, deletes, moves
symbol table stuff: finding variable declarations, types, references, etc.
All the stuff I would have gotten for free if I where doing the 
traditional tree-transform approach: function scopes,
 start and end of variables and statements, etc

I did pay a heavy price to implement my transformations on token streams 
rather than ASTs, but I am still happy
with the result. For example, I have an elaborate isDeclaration() method 
that tries to figure out if a given token is
a variable or function declaration by examining nearby tokens. The 
payoff is that I can simply write hundreds of
rules like the ones above without having to think about ASTs.

My GotoRemoverRule is just an implementation of one of the well-known 
goto removal algorithms, adopted to use
my library of token-stream functions. I handle preprocessor stuff by 
cheating: I've added rules to Monty's
cgram grammar to handle cpp constructs, but only when they appear where 
people usually put them. In this way,
I can be smart and make the same changes a human would do (as opposed to 
forcing a user to run cpp before running
Jazillian). So I can make changes like:
#define X 1    becomes   public final static int X = 1;
#define f(x) x+1   becomes  static int f(int x) { return x+1}

(This sharp crowd is now shaking their collective head. "How do you even 
know that x is an int? This
macro works on any type!" Answer: I looked at every reference of x. "But 
what if you weren't so lucky
and  you need a "float" version of f? Answer: I'm doing the best I can. 
I either handle that today, or
will get to it soon :)

C structs, unions, and enums become their own Java classes. I generate 
Java 1.4 code - not generating
any 1.5 stuff (like enums) yet. I do no magical "make it object 
oriented" tricks: each C file becomes a Java
class. And I generate the code that a human would generate. For example:
printf("i=%d  s=%s\n");   becomes System.out.println("i=" + i + " s=" + s);
(i.e. I see that you're ending with a newline, and so calling println. I 
will never generate a Java printf call.
That would work, but it's not what a real person would do).

Pointers are the toughest issue. I basically analyze all pointer usage, 
looking for the three most common
usages of pointers:
As an array index (in which case, I change the pointer to a Java int, 
and make the pointee an array
As some sort of pointer-based data structure (linked list, tree, 
whatever), which requires just minor syntax changes
As a way of doing pass-by-reference

I use ANTLR for all C lexing. The only place I use ANTLR for parsing and 
AST generation is for handling
expressions. Java's boolean type is required in certain places, where C 
programs use an int because
there's no built-in boolean type. So when we have "if (x)", I really 
have no choice other than to go ahead
and build an AST from "x" (which could be a large, complex expression), 
and then walk thru the AST
replacing various "int" nodes with "boolean" ones.

I guess that's it off the top of my head. I'll be happy to give more 
specifics if you have questions.
I'm currently working on a C++ version. I'm proud still have much of my 
sanity left after trying to map
templates to generics. That's not to say that I'm ever going to get it 
to work, but I see keeping my sanity
as a big plus :)

Andy


More information about the antlr-interest mailing list