[antlr-interest] philosophy about translation

Wed Oct 11 09:48:59 PDT 2006

Terence Parr wrote:

>
> * The nature of translation.  Andy refers to the difficulty of  
> translating COBOL because scattered pieces of the input need to be  
> collected into one location of the output language.  This is true and  
> COBOL is nasty--perhaps the personification of nasty, so I do not  
> want to trivialize this translation problem.

In addition to this "many-to-one" problem, we have "one-to-many" 
problems. The single COBOL code:
AGE PIX 9(2) VALUE 3.
...not only becomes a single Java variable declaration:
int age = 3;
...it also contributes to file format information (from that COBOL 
statement, we know that this field is
stored in "DISPLAY" format on disk). And this line of code can affect 
what we do with other fields that contain it.

> But, Andy made a  distinction I believe between a translator and a 
> compiler, which I  don't think I can agree with.  A compiler also 
> needs to see  declarations and other files and collect pieces from 
> many different  areas to do a translation.  I fail to see the 
> distinction.   

Then I'm not being clear :(  Take me literally when I say prog. lang. 
translation can be (if you have the goals that I have) is
like NLP (natural language processing). It's not a "closed, finite" 
solution, but rather it's open-ended. Suppose you build
your C to Java translator in the conventional way. You've built an C AST 
and transformed it into a Java AST. That took you
let's say a month. If you want realistic "natural" Java code, you're not 
even close to done. It's not just trivial things like
changing "!(i == 1)" to "i != 1". It's looking at how real people use 
memset() and doing the best you can to replace it.
Doing that for C pointer usage, and doing it well, could be a full-time 
job for many years.

I'm not just complaining that the problem is hard. I'm saying that it's 
so hard that the nature of the solution changes as it grows.
After that month, you had maybe 90% "tree-walking" code and 10% "other 
stuff". After a year or two, you now have
1% "tree-walking" code, and 99% "other stuff". Whether its a C-to-Java 
or an English-to-Spanish translator, what started out
as a tree-walking architecture with a fair amount of other stuff, is now 
just a lot of other stuff with a little treewalking in there.
Not only is it not a treewalking architecture, but you've spent all your 
time figuring out the best way to structure the "other stuff",
and an AST was not the best answer.

Go to someone who's spent years on real-world NLP and suggest that he 
first parse the input language into a tree. He'll tell you
that you're missing the whole problem. That there are more exceptions 
than rules. He'll challenge you to take 100 lines of
real-world English, get a professional to translate it to Spanish, and 
see how well your approach works. I'll throw out the
same challenge. Here is some code that has hand-written C, C#, Java, and 
COBOL versions:
http://www2.hursley.ibm.com/decimal/telco.html
Go through a mental treewalk approach on that C code and produce the 
Java version. NOT the C version with Java syntax,
not a Java version that works the same as the C version, the ACTUAL Java 
code that looks as a person would write it.
It doesn't have to pass a "diff" perfectly, but it should be close 
enough that it look hand-written. Then repeat for COBOL to Java.
I have done this, and I'm pretty sure that simply walking the input AST 
will get you 1% of the way there.

> Also  note that ANTLR has the exact same problem that it must collect  
> information from many areas of your grammar so it can determine how  
> to generate code.  There is a well-established mechanism for building  
> data structures such as symbol tables, use-def chains, trees,  etc.  
> for implementing these translators/compilers.   I am finding it hard  
> to believe that a symbol table is the wrong approach to resolve  
> symbols to their declaration, for example.  

I do use symbol tables, but I avoid them. Once you have lots of phases, 
you end up spending more effort keeping the symbol table up
to date than it's worth. I want to be able to pretty much just write:
long v[x];  --> int[] v = new int[x];
...without having to touch the symbol table.

> I believe no one will  suggest that listing a bunch of rules is the 
> right way to build a  compiler or ANTLR.  Because I cannot see the 
> fundamental distinction  between those two and the translations done 
> by Andy, I cannot agree  that for large translations a rule-based 
> system is the right  approach.  Previously I mentioned that I had not 
> done as nasty a  translator, but in retrospect I must consider ANTLR a 
> translator and  after 3.5 years of work on that single tool I think I 
> have a good  understanding of a translation process of similar magnitude.

The difference is that ANTLR's input is a language of your own choosing, 
and the output is very much one-to-one with the input.
Imagine if you had to deal with, say, plain BNF input. You'd say "OK, 
BNF is limited, so how does anyone specify actions?" and the
answer is some ad-hoc solution. That's how things are: everything in 
COBOL is ad-hoc, even C doesn't have features that
you wish it had (classes, exceptions, etc). How do you handle that? 
Well, my answer is "do what a person would do." I
look for patterns of code that check library call return values, and map 
them to Java exception usage, for example. C (and especially
COBOL) was not designed to be translated to Java, but ANTLR syntax was 
explicitly designed to be translated to
ANTLR-generated code.

>   [You  would not believe the crazy stuff I have to do in the area of  
> nonlocal code-generation; when confronted with ID on the right hand  
> side of a rewrite rule, code generation for ID is highly context- 
> sensitive.  Information must be taken from the options area at the  
> top of the file, at the start of the rule, and to the left of ->  
> symbol but only at the appropriate grammatical level (which requires  
> flow analysis and structure information).]
>
> * Ease of specification.  Rules are clearly easier than a tree walker  
> when they are clean and there are few of them.  How can one beat the  
> following?
>
> ADD v1 TO v2. --> v2 += v1;
>
> Though with trees, one could perhaps auto translate that text to  
> trees.   Or, just use raw:
>
> ^(ADD v1 v2) -> ^(PLUS_EQUAL v2 v1)
>
> Anyway, now make it 400 of those rules.  

I think it's more like: add 300 of those rules, and 50 rules that are
going to require a fair amount of code (e.g. remove gotos, remove pointers).
And another 50 that are almost simple enough to express as one-liners, 
but not quite.

> Now, add a bunch of  arbitrary actions that test "am I in this 
> function and did I see this  variable defined before by walking 
> backwards?" in addition to these  syntactic rules and I am certain 
> that your brain must be very large  to figure out the emergent 
> behavior of all of these rules.   

I still don't think you're seeing the problem as I do. I have a paper - 
no longer public, but avail on request - that list most of the
things a person would need to do to change C to Java. That's what needs 
to be done. Even if you just look at goto removal, for
example, you'll see it's far more complicated than "what actions should 
fire at certain places in the AST?". Or look at
all the handling of preprocessor stuff like "#DEFINE A 1" becoming 
"public final static A = 1;".

> This is  the standard Prolog programming problem.  Andy confirmed for 
> me on  the phone that debugging why the huge list of rules is not 
> giving you  the right translation requires tools that generate nice 
> HTML reports  and lots of thinking sometimes.  Further, the constant 
> threat of  infinite rule-application loops frightens me.    Perhaps he 
> can give  us a better idea of how often he hits a landmine.  I may be 
> incorrect  to focus on this problem.

When something goes wrong - a bug in a rule - it can be a mess to track 
back to which rule was really at fault.
The ordering of rules is also tough to get right. And I have hit 
problems with infinite loops, but that hasn't been a major heachache.
My approach has its drawbacks, but I couldn't even "get going" with the 
AST approach.

>
> One way to look at this problem is: do you want to drive the  
> translation based upon the input stream or based upon a set of  rules.  

Exactly.

> Even if I have to collect information along the way, I  preferred to 
> drive my translations based upon seeing input  constructs.  I can step 
> through easily with a debugger as it relates  directly to the input.  
> Driving things from the abstract set of  rules, just doesn't feel 
> natural to me.  But that could be a personal  choice.

I think it's all based on the scope of what you're trying to solve.
Implement a goto removal algorithm like this one: 
http://citeseer.ist.psu.edu/22521.html
and I think you'll find that the having the ability to fire some action 
at a "goto" and a "label" node in an AST is
just insignificant. It's not really a "do the following AST change at 
each goto" problem. And
changing a #define to a constant, or a function, or a variable, or just 
doing a direct replacement...it's the same thing.
By the time you've written all that code, who cares whether you're in 
the middle of walking an AST or
simply matched a token that started with "#define"?

>
> Perhaps one can look at this as ease of specification versus ease of  
> debugging and getting it right with all of these rules.   Rules are  
> easier for smaller systems, but I don't think they scale very well  
> (not in functionality of course just in ease of understanding the  
> entire system).  On the other hand, doing all of the work to get the  
> trees and the tree parser together can be a very high transaction  
> cost to get started.  In the long run though, I am thinking that  
> manipulating internal data structures, collecting information you  
> need, and then ultimately generating code is the best approach for  
> large systems.  That is not to say that we should not use the system  
> that makes us the most comfortable.  Clearly Andy is an exceptional  
> developer and has produced some amazing translators. no one can argue  
> with Andy's results.

The key is in the problem definition. If you take it as a given that 
you'll produce "natural" Java, then you have to decide how
to best do that. The fact that there is literally no other software 
other than Jazillian that produces "natural" code, says to me
that everyone who's approached it has taken the AST approach and hit a 
dead-end. IBM tried a COBOL-to-Java translator
and failed. How could that be? My guess is that they put a group of 
compiler-type people on it and said "make it work", rather than
put a group of NLP people on it, and said "do the best you can".

>
> Personally,  I think that a hybrid system that does most of the work  
> in a tree walker but with rules, for all of the special cases that  
> Andy has pointed out, would be great!

Me too. Or, as we discussed, let me write rules the way I think about 
them - as sequences of tokens. But you, internally, make it
into a tree. TXL does that, IIRC, but doesn't go far enough.

>
> * Generality.  From Andy's descriptions, he has many hand-built  
> functions asking questions about context.  As Sohail has pointed out,  
> this is the same as building parsers by hand and they are very  
> specific to the language being translated.

Not really - a parser builds an AST. It's more like providing a library 
to work on the lexer output.

> At this point, I am not  sure for what Andy uses ANTLR.  Probably to 
> build the rule-based  system he has.

No, I use it for the lexer, and also I do use ASTs for various 
operations on expressions (e.g. what type does
this expression return? Change this expression so it returns a boolean 
rather than an int).

>
>
> In summary, I think that a rule-based system has its place and I  
> would like to try my hand at building one learning from Andy's  
> experience.   For myself, I would limit its use to quick and dirty  
> translators that I need to build that have less than say 40 rules and  
> of course only in cases where I'm not concerned with speed.

I actually think the rule-based approach really shows its value at the 
high end of the scale - when you've already written
an AST-walking C-to-Java translator, and now it's time to try to produce 
realistic Java code. My code base is constantly
growing as I look and replace new patterns. Just this week, I've added 
two new rules to produce more realistic code
when replacing gotos. That's, of course, after a perfectly good 
goto-removal rule has been working fine for years.

>
> I'm perfectly willing to be wrong on this, just as I felt others were  
> wrong about the LL parsing approach back in the days when LR was king  
> in the early 90s.

I appreciate your openness, and I'm not saying I'm always right. Just 
trying to express what I've seen.

>
> This is one of the best discussions we have ever had on this mailing  
> list a thing.  Hooray!
>
> Ter
>
I've enjoyed it too. It's always fun to try to build something that 
seems to be impossible.
And even more fun for me to be able to have a good discussion about 
unconventional approaches
with people who are a lot smarter and more experienced than I am :)

Andy