[antlr-interest] Re: Translators Should Use Tree Grammars

Tue Nov 23 19:18:21 PST 2004

> ASPA translates core libraries too. The built-in functions and classes
> of Jscript and VBScript and ActiveX components are supported.
> This should be obvious from the prior post containing the steps
> to translate someString.length into strlen($someString).

I looked again and I don't see anything that it's doing that's
complex.
changing "someString.length" to "strlen(#someString)" is
pretty simple. I realize you support more complex stuff,
so could you give me an example?

One example that I do:
printf("i=%d c=%c\n", i, c);
...becomes...
System.out.println("i=" + i + " c=" + c);
So I parse the format string, check for "\n" at the end, 
and replace the various placeholders ("%d" and "%c") with
arguments (i and c), using the "+" operator. How would I
specify that (or something else that complicated) using
your system?

> The example I chose was deliberately simple.
> > 
> > Just to pick one example, many C functions return error codes.
> > For example, fopen() returns a 0 if it can't open a file.
> > That needs to be replaced with exception handling in Java.
> > So I have a list of the functions and the values they return
> > on error. I check for calls to the functions, and look for
> > various patterns of error checking, such as:
> > -----------------------------
> > if (fopen(xxx) == 0) { // return value checked
> > // error code
> > }
> > else {
> > // non-error code
> > }
> > -----------------------------
> > v = fopen(xxx); // return value stored and checked later
> > ...
> > if (v != 0) {
> > // non-error code
> > }
> > -----------------------------
> This is in my TODO list for ASPA.
> I had limited time for the development (was an assignment
> from the university) and didn't consider this issue
> as top priority.
> I have some ideas of how to implement this, but will check
> the way you handled this problem for inspiration.

The question that this thread poses, though, is 
"will you be using a treewalker to implement this?"
I doubt that ANTLR's treewalker matching could handle it
(at least not in a cleaner way than writing it by hand).

> > syntactic rules as you do, but I also have many complex 
> > rules like this one.
> > 
> > So, now that I think about it,
> > maybe even this one rule involves several things that you probably
> > wouldn't see in your typical language-to-language translator:
> > 
> > * handling of library functions, not just core language
> This is done in ASPA.

Hmm...anyone have a feel for the size of the
ASP libraries vs. the C libraries
(http://www.gnu.org/software/libc/manual/html_node/)?

> > * replacing whole mechanisms/paradigms (error codes from 
> >   library functions being replaced by exception handling)
> No
> > * complex pattern matching (e.g. checking for various comparisons
> >   the return value like ==, !=, <, etc. and even checking for
> >   storage in a variable and then usage of that variable)
> I don't understand the point about the operators.
> Operators which are handled differently then others in ASPA
> are:
> JScript: The '+' operator. If the operands are numbers the translation
> is '+', otherwise the operator is considered to perform string
concatenation.
> VB: The logical operators (and, or, not ...) if have int operands
are thought
> to perform bitwise operations, and are handled properly.

My point was that when looking for various patterns of
checking for error conditions in C code, you've
got to also check for things like
if (fopen < 0)
if (fopen != 0)

...and not just
if (fopen == 0).
that's all.

> > 
> > In case you think that this rule is just an exceptionally
> > complex one, here are a few other examples:
> > 
> > * structs, unions, and enums become whole Java classes, including
> >   constructors and changes at each reference
> The same is true for jscript classes. The functions are placed inside
> the class body, if the variables(members) are defined and assigned a
value
> in the PHP class only the declaration exists and the members are
initialized
> inside the constructor body, etc 

Can you give me an example of where in your code you produce
constructor code?

> > * memory management is done "by hand" in C must be changed to
> >   use Java objects.
> This is a problem specific to your project. A complex problem
> indeed.
> > * I handle multiple input files, and change C file names
> >   to Java ones (including combining "hello.c" and "hello.h" into
> >   "Hello.java"
> This is true for ASPA too. Other files are included in a file
> with the #include directive.

And I assume that variables declared in one file can be referenced
in another? Do you put all files in one large AST? If not,
how do you handle moving things from one AST to another?

> Also, because ASP contains two languages, there exists a mechanism
> to share information among different parsers and TreePasers, but this
> is specific to my application. 
> > * There are different rules in Java and C for where an array
> >   can be initialized.
> > * The syntax and semantics for array declaration are different
> >   (In C, it's "struct person a[3];", in Java it's 
> >    "Person a[] = new Person[3];" plus a loop to initialize it)
> None of this problems are unachievable with a TreeParser in my opinion.

Sure, but what I'm trying to do is to bring up the things that
I do that don't seem (to me) like they'd be easy using a
treewalker, and asking you (or anyone else) to explain how
you use a treewalker to implement them. You mentioned
that you use a treewalker, so I'm still struggling to
understand how you use it for anything nontrivial.

> A similar issue about arrays exists in VB.
> Example:
> dim a(12, 3) 'multi dimension array
> a(0, 1) = 0
> a(0, 2) = 2
> PHP:
> $a = array(array()); //emulate multi dimension arrays with nested ones
> $a[0][0] = 1;
> $a[0][1] = 2;
> I give this example to illustrate that the 2 problems you refer
> are application specific, but a similar problem can be solved
> with the TreeParser approach.

I don't see how that problem is similar. Isn't the VB-to-PHP
translation that you show here a trivial syntax change?
(or am I missing something, like the fact that in VB you're
setting element (0,1) to 0 and in PHP, you're setting element
(0,) to 1).

I do stuff like if I see a two-dimensional array of chars,
check all the usages to see if it should be changed to
a one-dimensional array of Strings. Or check if a one
dimensional array of chars should just be a String.
I then look for all kinds of gyrations that C programmers
do on char arrays (putting in '\0', searching it, etc) and
replace them with calls to String methods.

> > Now I'm really starting to wonder about how much all the
> > language-to-language translators out there are really doing.
> > I know for a fact that the C-to-Java ones (other than Jazillian)
> > are only doing trivial syntactic changes
> > (see http://jazillian.com/competition.html for details).
> It should feel good to know that you have built the best tool
> available for a problem.
> 
> What this thread should make clear is that there are many
> approaches to solve the same domain of problems. 

I'm sorry to be so hardheaded and I mean no disrespect, but
I need to understand more about what you're doing to
convince myself that we really are dealing with the same domain.

If you could just briefly describe, say, your most complex
transformation, and point me to the right place in your
code so I can investigate more, I'd appreciate it.

> I consider
> your work important because offerers an alternative methodology
> for solving the translation problem. But personally I prefer
> the TreeParser methodology. You point that there are 800 lines
> of code fired by the treewalker. I don't know, because never counted
> them. Anyway the aid the treewalker offered for me was that it would
> fire the appropriate method and I didn't have to worry about it.
> For example in the rule:
> #(DOT (d1:expression | dt:type | dthis:THIS) d2:expression <dot>)
> the code:
>     if (#d1 != null) { //expression.expression
>         #expression = resolveDot(#d1, #d2);
>     } else if (#dt != null) { //Array.prototype
>         #expression = resolveClassAttribute(#dt, #d2);
>     } else { //this.something
>         #expression = resolveObjectAttribute(#d2);
>     }
> is fired.

Hmm...so you're saying "here are all the transformations
that need to be done whenever I encounter a DOT inside an
expression". Boy, it really is hard for me to think about
things that way, I'm so used to my other way. Let me
see if I can think of some of the situations that I
handle that deal with a DOT inside an expression.

Here's one: in C, a field can be a function pointer, so you
can have a function call: "a.f()" (the syntax is far
more complicated, but if I have to write it down,
I'll throw up...but you get the idea). 
Java doesn't have function pointers,
so I check for various function-pointer patterns
 and replace all the function pointer
fields and use Java reflection in its place. It's all very
involved. 

Here's another usage of DOT: one rule might replace a sort
by a call to "Collections.sort(a)". I have a rule that
looks for usage of classes (such as Collections) that
require Java "import" statements. Do I really want to do
a check at the DOT node that says "if the left side
is the name of a class, then import that class"?
That logic has nothing at all to do with the
function-pointer logic - why should they be in
the same place in the code? Just because they both
happen to involve a DOT?

I would
guess there are a few more situations where I have to
do various transformations involving DOT. If so, I'd
have to add even more unrelated cases to this one "DOT" place in
the code. That's slicing the problem the wrong way.

> This is very important because from the grammar definition
> I can be sure that only those 3 cases can occur.

Yea, I can appreciate that. You're sure you've handled
every possible input. But on the other hand, you're not
at all sure that you've handled all the cases.
For example, you surely have one place in the grammar
that handles the "+" binary operator. You know you
have things covered by covering that one case with, say,
a call to a handleBinaryPlus() method. But what does that
method do? Does it:
* remove redundant zeros (x+0 and 0+x become x)
* simplify expressions (x + -1 becomes x - 1)
* record the fact that each operand is involved in an arithmatic
  operation (and thus better not get it's type changed to boolean)
* combine consecutive string concatenation where possible ("a" + "b" 
   becomes just "ab")

I'm just making these examples up, but you see my point: that
many nodes in the grammar will have multiple, unrelated,
"actions" that need to be fired.

Isn't it more natural to have a separate rule for each of the
above items? That way, 1) we avoid having this handleBinaryPlus()
method performing 4 completely unrelated functions, and 2)
we avoid having the "change x from int to boolean" logic
split across handleBinaryPlus() and other functions.

I know it sounds weird to say "I'm not too concerned
that I cover all the cases of possible input". But picture
a natural language processing engine. Do they even care that
they've got code to fire on each possible input grammar
(noun-verb-object, adj-noun-verb-object, and 10,000 others)?
No. They're too busy writing zillions of rules to handle
all the real-world cases. ("Great, you have a rule that
replaces adj-noun in English to noun-adj in Spanish, now
you're 1% of the way done, start writing exception rules,
starting with "The White House").

>  Additionally
> it was simple to model the methods which should offer the functionality
> required for each case. I did not have to worry about how the
expressions
> d1 and d2 should be translated because the treewalker had previously
fired
> the appropriate code sections to take care of that.

I just don't see the advantage of this "fire a rule at each
node" approach. As I look through my rules, almost none
of them involve a single node in the tree. And even when 
rules do involve a single node, I don't want to mix
them together. For example, one rule removes the "u"
in "123u" (java doesn't have unsigned types). And another
removes the "L" in "123L" (because a C long is [usually]
an Java int). Yes, I can have one handleNumber() method
fire at the NUMBER node that does both of these. But
I'd rather not slice the problem that way. Instead, I'd
like the NumberWithURule to traverse the AST and make its
changes, and the NumberWithLRule to traverse the AST and
make its changes.

> So the problem is partitioned effectively into smaller problems by
the treewalker
> for me. This is the greatest benefit from the TreeParser in my opinion.
> It doesn't matter if there are 800 lines of code or less activated
by the
> treewalker. What matters is that the treewalker offers entry points
for specific
> sub-problems. 

My point is that as that 800 lines grows to tens of thousands
of lines, most of the code will start to deal with whole sections
of the tree rather than individual nodes.

> 
> Anyway, I think that the choice of methodology (treewalker or other)
> is up to the developer. It can just be a matter of taste if all
> the available methodologies can offer a solution to the problem.

Yes, definitely.

> 
> What ASPA (and other TreeParser based translators) and jazillian
> prove is that both methodologies are capable to offer a solution.

I'm sure the treewalker approach is well proven and works
fine for most translators. It's just that my goals with
Jazillian bit off more than a treewalker can chew :)

> > 
> > What's the most complex translator that that people
> > have seen? (Complex meaning functionality, not internals).
> I can't tell.
> 
> Anakreon

Thanks again for your time.
Andy

p.s. I assume the Spanish translation of "The White House" is
"The White House" and not "La Casa Blanca" :)

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/