[antlr-interest] philosophy about translation
Andy Tripp
antlr at jazillian.com
Wed Oct 11 09:48:59 PDT 2006
Terence Parr wrote:
>
> * The nature of translation. Andy refers to the difficulty of
> translating COBOL because scattered pieces of the input need to be
> collected into one location of the output language. This is true and
> COBOL is nasty--perhaps the personification of nasty, so I do not
> want to trivialize this translation problem.
In addition to this "many-to-one" problem, we have "one-to-many"
problems. The single COBOL code:
AGE PIX 9(2) VALUE 3.
...not only becomes a single Java variable declaration:
int age = 3;
...it also contributes to file format information (from that COBOL
statement, we know that this field is
stored in "DISPLAY" format on disk). And this line of code can affect
what we do with other fields that contain it.
> But, Andy made a distinction I believe between a translator and a
> compiler, which I don't think I can agree with. A compiler also
> needs to see declarations and other files and collect pieces from
> many different areas to do a translation. I fail to see the
> distinction.
Then I'm not being clear :( Take me literally when I say prog. lang.
translation can be (if you have the goals that I have) is
like NLP (natural language processing). It's not a "closed, finite"
solution, but rather it's open-ended. Suppose you build
your C to Java translator in the conventional way. You've built an C AST
and transformed it into a Java AST. That took you
let's say a month. If you want realistic "natural" Java code, you're not
even close to done. It's not just trivial things like
changing "!(i == 1)" to "i != 1". It's looking at how real people use
memset() and doing the best you can to replace it.
Doing that for C pointer usage, and doing it well, could be a full-time
job for many years.
I'm not just complaining that the problem is hard. I'm saying that it's
so hard that the nature of the solution changes as it grows.
After that month, you had maybe 90% "tree-walking" code and 10% "other
stuff". After a year or two, you now have
1% "tree-walking" code, and 99% "other stuff". Whether its a C-to-Java
or an English-to-Spanish translator, what started out
as a tree-walking architecture with a fair amount of other stuff, is now
just a lot of other stuff with a little treewalking in there.
Not only is it not a treewalking architecture, but you've spent all your
time figuring out the best way to structure the "other stuff",
and an AST was not the best answer.
Go to someone who's spent years on real-world NLP and suggest that he
first parse the input language into a tree. He'll tell you
that you're missing the whole problem. That there are more exceptions
than rules. He'll challenge you to take 100 lines of
real-world English, get a professional to translate it to Spanish, and
see how well your approach works. I'll throw out the
same challenge. Here is some code that has hand-written C, C#, Java, and
COBOL versions:
http://www2.hursley.ibm.com/decimal/telco.html
Go through a mental treewalk approach on that C code and produce the
Java version. NOT the C version with Java syntax,
not a Java version that works the same as the C version, the ACTUAL Java
code that looks as a person would write it.
It doesn't have to pass a "diff" perfectly, but it should be close
enough that it look hand-written. Then repeat for COBOL to Java.
I have done this, and I'm pretty sure that simply walking the input AST
will get you 1% of the way there.
> Also note that ANTLR has the exact same problem that it must collect
> information from many areas of your grammar so it can determine how
> to generate code. There is a well-established mechanism for building
> data structures such as symbol tables, use-def chains, trees, etc.
> for implementing these translators/compilers. I am finding it hard
> to believe that a symbol table is the wrong approach to resolve
> symbols to their declaration, for example.
I do use symbol tables, but I avoid them. Once you have lots of phases,
you end up spending more effort keeping the symbol table up
to date than it's worth. I want to be able to pretty much just write:
long v[x]; --> int[] v = new int[x];
...without having to touch the symbol table.
> I believe no one will suggest that listing a bunch of rules is the
> right way to build a compiler or ANTLR. Because I cannot see the
> fundamental distinction between those two and the translations done
> by Andy, I cannot agree that for large translations a rule-based
> system is the right approach. Previously I mentioned that I had not
> done as nasty a translator, but in retrospect I must consider ANTLR a
> translator and after 3.5 years of work on that single tool I think I
> have a good understanding of a translation process of similar magnitude.
The difference is that ANTLR's input is a language of your own choosing,
and the output is very much one-to-one with the input.
Imagine if you had to deal with, say, plain BNF input. You'd say "OK,
BNF is limited, so how does anyone specify actions?" and the
answer is some ad-hoc solution. That's how things are: everything in
COBOL is ad-hoc, even C doesn't have features that
you wish it had (classes, exceptions, etc). How do you handle that?
Well, my answer is "do what a person would do." I
look for patterns of code that check library call return values, and map
them to Java exception usage, for example. C (and especially
COBOL) was not designed to be translated to Java, but ANTLR syntax was
explicitly designed to be translated to
ANTLR-generated code.
> [You would not believe the crazy stuff I have to do in the area of
> nonlocal code-generation; when confronted with ID on the right hand
> side of a rewrite rule, code generation for ID is highly context-
> sensitive. Information must be taken from the options area at the
> top of the file, at the start of the rule, and to the left of ->
> symbol but only at the appropriate grammatical level (which requires
> flow analysis and structure information).]
>
> * Ease of specification. Rules are clearly easier than a tree walker
> when they are clean and there are few of them. How can one beat the
> following?
>
> ADD v1 TO v2. --> v2 += v1;
>
> Though with trees, one could perhaps auto translate that text to
> trees. Or, just use raw:
>
> ^(ADD v1 v2) -> ^(PLUS_EQUAL v2 v1)
>
> Anyway, now make it 400 of those rules.
I think it's more like: add 300 of those rules, and 50 rules that are
going to require a fair amount of code (e.g. remove gotos, remove pointers).
And another 50 that are almost simple enough to express as one-liners,
but not quite.
> Now, add a bunch of arbitrary actions that test "am I in this
> function and did I see this variable defined before by walking
> backwards?" in addition to these syntactic rules and I am certain
> that your brain must be very large to figure out the emergent
> behavior of all of these rules.
I still don't think you're seeing the problem as I do. I have a paper -
no longer public, but avail on request - that list most of the
things a person would need to do to change C to Java. That's what needs
to be done. Even if you just look at goto removal, for
example, you'll see it's far more complicated than "what actions should
fire at certain places in the AST?". Or look at
all the handling of preprocessor stuff like "#DEFINE A 1" becoming
"public final static A = 1;".
> This is the standard Prolog programming problem. Andy confirmed for
> me on the phone that debugging why the huge list of rules is not
> giving you the right translation requires tools that generate nice
> HTML reports and lots of thinking sometimes. Further, the constant
> threat of infinite rule-application loops frightens me. Perhaps he
> can give us a better idea of how often he hits a landmine. I may be
> incorrect to focus on this problem.
When something goes wrong - a bug in a rule - it can be a mess to track
back to which rule was really at fault.
The ordering of rules is also tough to get right. And I have hit
problems with infinite loops, but that hasn't been a major heachache.
My approach has its drawbacks, but I couldn't even "get going" with the
AST approach.
>
> One way to look at this problem is: do you want to drive the
> translation based upon the input stream or based upon a set of rules.
Exactly.
> Even if I have to collect information along the way, I preferred to
> drive my translations based upon seeing input constructs. I can step
> through easily with a debugger as it relates directly to the input.
> Driving things from the abstract set of rules, just doesn't feel
> natural to me. But that could be a personal choice.
I think it's all based on the scope of what you're trying to solve.
Implement a goto removal algorithm like this one:
http://citeseer.ist.psu.edu/22521.html
and I think you'll find that the having the ability to fire some action
at a "goto" and a "label" node in an AST is
just insignificant. It's not really a "do the following AST change at
each goto" problem. And
changing a #define to a constant, or a function, or a variable, or just
doing a direct replacement...it's the same thing.
By the time you've written all that code, who cares whether you're in
the middle of walking an AST or
simply matched a token that started with "#define"?
>
> Perhaps one can look at this as ease of specification versus ease of
> debugging and getting it right with all of these rules. Rules are
> easier for smaller systems, but I don't think they scale very well
> (not in functionality of course just in ease of understanding the
> entire system). On the other hand, doing all of the work to get the
> trees and the tree parser together can be a very high transaction
> cost to get started. In the long run though, I am thinking that
> manipulating internal data structures, collecting information you
> need, and then ultimately generating code is the best approach for
> large systems. That is not to say that we should not use the system
> that makes us the most comfortable. Clearly Andy is an exceptional
> developer and has produced some amazing translators. no one can argue
> with Andy's results.
The key is in the problem definition. If you take it as a given that
you'll produce "natural" Java, then you have to decide how
to best do that. The fact that there is literally no other software
other than Jazillian that produces "natural" code, says to me
that everyone who's approached it has taken the AST approach and hit a
dead-end. IBM tried a COBOL-to-Java translator
and failed. How could that be? My guess is that they put a group of
compiler-type people on it and said "make it work", rather than
put a group of NLP people on it, and said "do the best you can".
>
> Personally, I think that a hybrid system that does most of the work
> in a tree walker but with rules, for all of the special cases that
> Andy has pointed out, would be great!
Me too. Or, as we discussed, let me write rules the way I think about
them - as sequences of tokens. But you, internally, make it
into a tree. TXL does that, IIRC, but doesn't go far enough.
>
> * Generality. From Andy's descriptions, he has many hand-built
> functions asking questions about context. As Sohail has pointed out,
> this is the same as building parsers by hand and they are very
> specific to the language being translated.
Not really - a parser builds an AST. It's more like providing a library
to work on the lexer output.
> At this point, I am not sure for what Andy uses ANTLR. Probably to
> build the rule-based system he has.
No, I use it for the lexer, and also I do use ASTs for various
operations on expressions (e.g. what type does
this expression return? Change this expression so it returns a boolean
rather than an int).
>
>
> In summary, I think that a rule-based system has its place and I
> would like to try my hand at building one learning from Andy's
> experience. For myself, I would limit its use to quick and dirty
> translators that I need to build that have less than say 40 rules and
> of course only in cases where I'm not concerned with speed.
I actually think the rule-based approach really shows its value at the
high end of the scale - when you've already written
an AST-walking C-to-Java translator, and now it's time to try to produce
realistic Java code. My code base is constantly
growing as I look and replace new patterns. Just this week, I've added
two new rules to produce more realistic code
when replacing gotos. That's, of course, after a perfectly good
goto-removal rule has been working fine for years.
>
> I'm perfectly willing to be wrong on this, just as I felt others were
> wrong about the LL parsing approach back in the days when LR was king
> in the early 90s.
I appreciate your openness, and I'm not saying I'm always right. Just
trying to express what I've seen.
>
> This is one of the best discussions we have ever had on this mailing
> list a thing. Hooray!
>
> Ter
>
I've enjoyed it too. It's always fun to try to build something that
seems to be impossible.
And even more fun for me to be able to have a good discussion about
unconventional approaches
with people who are a lot smarter and more experienced than I am :)
Andy
More information about the antlr-interest
mailing list