[antlr-interest] Re: Translators Should Use Tree Grammars

Wed Nov 24 11:55:04 PST 2004

Anakreon,
Thanks again for the input. Just a few comments.

> I think the domain is source->source translation.

After all this discussion, I'm starting to think of my thing
as at a bit higher-level than other translators (at least
ASPA). Yes, I do the simple stuff, but I'm mostly doing
the "rewrite this C code in the way a human would rewrite it".
That includes stuff like:
* replace all library calls
* replace return-error-codes with exception handling
* remove all gotos
* replace enum/struct/union with a new class
* replace all pointers (by far the most complicated part of my app)
* restructure code (refactoring) to fit the Java style
* rename files, vars, methods to follow Java conventions
etc.

So I'd like to call my thing a "language rewriter" rather than
"language translator". Where "rewrite" here means "This C
code is a mess, let's rewrite it in Java", not "term rewriting".

> The way of thinking is a matter of idiosynkrasi (I hope
> this is the English word)
Idiosyncratic, I guess.

> Hard problem but application specific. 

Isn't *everything* application specific? I mean If we're translating
from language A to B, and that's all we know so far, 
is there anything that we know needs to be done?
Even using a lexer or parser may not be necessary.

> An import may be required for the translated code in ASPA
> too. In the XML file, a requirement for an import can be defined
> for a class (if any member of the class is called an require 'file'
> statement is generated), for a member (only if the member is called
> the 'require' is generated) or a function. 

I had thought of doing it that way, too, but decided against it.
I didn't want to constantly have to worry about what
"import-needed" classes each rule might need. So I just
have a single rule at the end that has a list of all the
"import-needed" classes (kept in a file that's generated
from the Java source via a doclet - pretty cool),
 and searches for them.

> > I would
> > guess there are a few more situations where I have to
> > do various transformations involving DOT. If so, I'd
> > have to add even more unrelated cases to this one "DOT" place in
> > the code. That's slicing the problem the wrong way.
> How can we discuss the right way of slicing a problem
> in an objective way?

Yea, good point. Now I tend to think that my problem with
treewalkers is just due to my (possibly unique) approach.
As I mentioned elsewhere, perhaps my "translator" is functioning
more like a natural language translator.

> > 
> > 
> >>This is very important because from the grammar definition
> >>I can be sure that only those 3 cases can occur.
> > 
> > 
> > Yea, I can appreciate that. You're sure you've handled
> > every possible input. But on the other hand, you're not
> > at all sure that you've handled all the cases.
> > For example, you surely have one place in the grammar
> > that handles the "+" binary operator. You know you
> > have things covered by covering that one case with, say,
> > a call to a handleBinaryPlus() method. But what does that
> > method do? Does it:
> > * remove redundant zeros (x+0 and 0+x become x)
> > * simplify expressions (x + -1 becomes x - 1)
> No. ASPA is not a code optimizer but a translator.

Right, and mine is a "rewriter" - and a human wouldn't write
"x+0", so I refactor stuff like that. 
(Note that this "x+0" is the example used by Terence
in his guide to treewalkers.

> > * record the fact that each operand is involved in an arithmatic
> >   operation (and thus better not get it's type changed to boolean)
> ASP and PHP(fortunately) are loosely typed languages. But still some
> translating decisions are based on the types of the operands.
> The way to handle this is based on heuristic methods and is not
guarantied
> to be always successful.
> > * combine consecutive string concatenation where possible ("a" + "b" 
> >    becomes just "ab")
> This can be done but I didn't do it. An other thing are expressions
> like
> a = " there"
> Response.write "hello" & a 'prints hello there
> which in PHP can be simplified into:
> print("hello$a")
> I didn't do this because I'm not sure a user of the
> program would think of it as a feature. He could prefer
> to have print("hello" . $a) instead. But it can be easily done.

Yes, I don't do that one either, but I do similar stuff.
I have 19 rules (written in Java - more that are just written
with simple text) witch just do this "normalization"
(changes that do not change functionality, but just clean up
the structure), like add braces around "then" and "else" clauses,
combine statements where appropriate, remove commas within
"for" clauses, etc.

.
> 
> > Isn't it more natural to have a separate rule for each of the
> > above items? That way, 1) we avoid having this handleBinaryPlus()
> > method performing 4 completely unrelated functions, and 2)
> > we avoid having the "change x from int to boolean" logic
> > split across handleBinaryPlus() and other functions.
> A matter of taste I'm afraid.

It's also a matter of degree, I think.
I suspect that most of the actions that fire within your grammar
only have a single "rule to apply" or at most two or three.
For me, just the fact that I have hundreds of rules and only
about 70 non-terminals in the grammar, means that I have
at least a few "rules" that apply at each grammar node.

> 
> > I just don't see the advantage of this "fire a rule at each
> > node" approach. As I look through my rules, almost none
> > of them involve a single node in the tree.
> Example:
> a = "one" + true + "two" + new Date() + "three" + 5
> (EXPR (ASSIGN a (PLUS (PLUS (PLUS (PLUS (PLUS one true) two) (NEW
DATE ELIST)) three) 5)
> The translated one:
> (EXPR (ASSIGN a (CONCAT (CONCAT (CONCAT (CONCAT (CONCAT one true)
two) [METHOD_CALL, getdate]) three) 5)
> Except for the deepest PLUS, there are single node operands involved.
> But each time PLUS rule is called it only cares about the [guessed] type
> of it's operands. 

Ahhh...good example! To translate that line to Java, you've
got to first convert each of the non-string operands to
a String. To do that, you've got to know something about
what "new Date()" evaluates to, as well as what the other
operands evaluate to. You would need to change "new Date()"
to something, perhaps quite different, but ONLY do that
change because the "new Date()" appears inside an expression
involving "+" where the expresion will be String type. You cannot
just blindly convert "new Date()" to something without
knowing the context (maybe you can if you're just trying to
produce code that works, but not if you're trying to
write code that looks hand-written).

> > And even when 
> > rules do involve a single node, I don't want to mix
> > them together. For example, one rule removes the "u"
> > in "123u" (java doesn't have unsigned types). And another
> > removes the "L" in "123L" (because a C long is [usually]
> > an Java int). Yes, I can have one handleNumber() method
> > fire at the NUMBER node that does both of these. But
> > I'd rather not slice the problem that way. Instead, I'd
> > like the NumberWithURule to traverse the AST and make its
> > changes, and the NumberWithLRule to traverse the AST and
> > make its changes.
> ASPA is single pass, but you could do those transformations
> mentioned above in many passes.

And how many passes (and associated tree-grammars) do I have to
have before it becomes a total mess? 10? 50?
I really don't want to keep track of all those rules each
doing multiple passes.
(say the "new Date()" conversion rule relies on the first
pass to find the "new Date()" pattern, another pass to
check the local context, a third pass to add some
"import" that the Date() class might need)

Andy

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/