[antlr-interest] philosophy about translation

Tue Oct 10 13:11:57 PDT 2006

Hi Andy et al,

Andy and I had a nice chat on the phone the other day and so I have a  
much better idea of how his system works.  He and I will be meeting  
next week and will no doubt have a great conversation about  
translation!    I am hoping that he will correct many of my  
misconceptions.  I think I have a lot to learn from him.

In the meantime, I think that there are some high-level issues that  
can be raised that might help someone choose one path over the  
other.  I will try to summarize my thoughts on overall issues here,  
and others can correct me.

* Speed.  If you need your translator to run in seconds/minutes not  
hours then most declarative purely rule-based systems will not be  
what you want.  For moving legacy code forward, however, there is no  
time pressure; so declarative systems work great for legacy code.     
I note that that is the area they focus on (TXL, DMS, ...).  This  
neatly slices off a huge fraction of the applications that must  
operate quickly.

* The nature of translation.  Andy refers to the difficulty of  
translating COBOL because scattered pieces of the input need to be  
collected into one location of the output language.  This is true and  
COBOL is nasty--perhaps the personification of nasty, so I do not  
want to trivialize this translation problem. But, Andy made a  
distinction I believe between a translator and a compiler, which I  
don't think I can agree with.  A compiler also needs to see  
declarations and other files and collect pieces from many different  
areas to do a translation.  I fail to see the distinction.   Also  
note that ANTLR has the exact same problem that it must collect  
information from many areas of your grammar so it can determine how  
to generate code.  There is a well-established mechanism for building  
data structures such as symbol tables, use-def chains, trees,  etc.  
for implementing these translators/compilers.   I am finding it hard  
to believe that a symbol table is the wrong approach to resolve  
symbols to their declaration, for example.  I believe no one will  
suggest that listing a bunch of rules is the right way to build a  
compiler or ANTLR.  Because I cannot see the fundamental distinction  
between those two and the translations done by Andy, I cannot agree  
that for large translations a rule-based system is the right  
approach.  Previously I mentioned that I had not done as nasty a  
translator, but in retrospect I must consider ANTLR a translator and  
after 3.5 years of work on that single tool I think I have a good  
understanding of a translation process of similar magnitude.  [You  
would not believe the crazy stuff I have to do in the area of  
nonlocal code-generation; when confronted with ID on the right hand  
side of a rewrite rule, code generation for ID is highly context- 
sensitive.  Information must be taken from the options area at the  
top of the file, at the start of the rule, and to the left of ->  
symbol but only at the appropriate grammatical level (which requires  
flow analysis and structure information).]

* Ease of specification.  Rules are clearly easier than a tree walker  
when they are clean and there are few of them.  How can one beat the  
following?

ADD v1 TO v2. --> v2 += v1;

Though with trees, one could perhaps auto translate that text to  
trees.   Or, just use raw:

^(ADD v1 v2) -> ^(PLUS_EQUAL v2 v1)

Anyway, now make it 400 of those rules.  Now, add a bunch of  
arbitrary actions that test "am I in this function and did I see this  
variable defined before by walking backwards?" in addition to these  
syntactic rules and I am certain that your brain must be very large  
to figure out the emergent behavior of all of these rules.   This is  
the standard Prolog programming problem.  Andy confirmed for me on  
the phone that debugging why the huge list of rules is not giving you  
the right translation requires tools that generate nice HTML reports  
and lots of thinking sometimes.  Further, the constant threat of  
infinite rule-application loops frightens me.    Perhaps he can give  
us a better idea of how often he hits a landmine.  I may be incorrect  
to focus on this problem.

One way to look at this problem is: do you want to drive the  
translation based upon the input stream or based upon a set of  
rules.  Even if I have to collect information along the way, I  
preferred to drive my translations based upon seeing input  
constructs.  I can step through easily with a debugger as it relates  
directly to the input.  Driving things from the abstract set of  
rules, just doesn't feel natural to me.  But that could be a personal  
choice.

Perhaps one can look at this as ease of specification versus ease of  
debugging and getting it right with all of these rules.   Rules are  
easier for smaller systems, but I don't think they scale very well  
(not in functionality of course just in ease of understanding the  
entire system).  On the other hand, doing all of the work to get the  
trees and the tree parser together can be a very high transaction  
cost to get started.  In the long run though, I am thinking that  
manipulating internal data structures, collecting information you  
need, and then ultimately generating code is the best approach for  
large systems.  That is not to say that we should not use the system  
that makes us the most comfortable.  Clearly Andy is an exceptional  
developer and has produced some amazing translators. no one can argue  
with Andy's results.

Personally,  I think that a hybrid system that does most of the work  
in a tree walker but with rules, for all of the special cases that  
Andy has pointed out, would be great!

* Generality.  From Andy's descriptions, he has many hand-built  
functions asking questions about context.  As Sohail has pointed out,  
this is the same as building parsers by hand and they are very  
specific to the language being translated.  At this point, I am not  
sure for what Andy uses ANTLR.  Probably to build the rule-based  
system he has.

In summary, I think that a rule-based system has its place and I  
would like to try my hand at building one learning from Andy's  
experience.   For myself, I would limit its use to quick and dirty  
translators that I need to build that have less than say 40 rules and  
of course only in cases where I'm not concerned with speed.

I'm perfectly willing to be wrong on this, just as I felt others were  
wrong about the LL parsing approach back in the days when LR was king  
in the early 90s.

This is one of the best discussions we have ever had on this mailing  
list a thing.  Hooray!

Ter