[antlr-interest] philosophy about translation
Terence Parr
parrt at cs.usfca.edu
Tue Oct 10 13:11:57 PDT 2006
Hi Andy et al,
Andy and I had a nice chat on the phone the other day and so I have a
much better idea of how his system works. He and I will be meeting
next week and will no doubt have a great conversation about
translation! I am hoping that he will correct many of my
misconceptions. I think I have a lot to learn from him.
In the meantime, I think that there are some high-level issues that
can be raised that might help someone choose one path over the
other. I will try to summarize my thoughts on overall issues here,
and others can correct me.
* Speed. If you need your translator to run in seconds/minutes not
hours then most declarative purely rule-based systems will not be
what you want. For moving legacy code forward, however, there is no
time pressure; so declarative systems work great for legacy code.
I note that that is the area they focus on (TXL, DMS, ...). This
neatly slices off a huge fraction of the applications that must
operate quickly.
* The nature of translation. Andy refers to the difficulty of
translating COBOL because scattered pieces of the input need to be
collected into one location of the output language. This is true and
COBOL is nasty--perhaps the personification of nasty, so I do not
want to trivialize this translation problem. But, Andy made a
distinction I believe between a translator and a compiler, which I
don't think I can agree with. A compiler also needs to see
declarations and other files and collect pieces from many different
areas to do a translation. I fail to see the distinction. Also
note that ANTLR has the exact same problem that it must collect
information from many areas of your grammar so it can determine how
to generate code. There is a well-established mechanism for building
data structures such as symbol tables, use-def chains, trees, etc.
for implementing these translators/compilers. I am finding it hard
to believe that a symbol table is the wrong approach to resolve
symbols to their declaration, for example. I believe no one will
suggest that listing a bunch of rules is the right way to build a
compiler or ANTLR. Because I cannot see the fundamental distinction
between those two and the translations done by Andy, I cannot agree
that for large translations a rule-based system is the right
approach. Previously I mentioned that I had not done as nasty a
translator, but in retrospect I must consider ANTLR a translator and
after 3.5 years of work on that single tool I think I have a good
understanding of a translation process of similar magnitude. [You
would not believe the crazy stuff I have to do in the area of
nonlocal code-generation; when confronted with ID on the right hand
side of a rewrite rule, code generation for ID is highly context-
sensitive. Information must be taken from the options area at the
top of the file, at the start of the rule, and to the left of ->
symbol but only at the appropriate grammatical level (which requires
flow analysis and structure information).]
* Ease of specification. Rules are clearly easier than a tree walker
when they are clean and there are few of them. How can one beat the
following?
ADD v1 TO v2. --> v2 += v1;
Though with trees, one could perhaps auto translate that text to
trees. Or, just use raw:
^(ADD v1 v2) -> ^(PLUS_EQUAL v2 v1)
Anyway, now make it 400 of those rules. Now, add a bunch of
arbitrary actions that test "am I in this function and did I see this
variable defined before by walking backwards?" in addition to these
syntactic rules and I am certain that your brain must be very large
to figure out the emergent behavior of all of these rules. This is
the standard Prolog programming problem. Andy confirmed for me on
the phone that debugging why the huge list of rules is not giving you
the right translation requires tools that generate nice HTML reports
and lots of thinking sometimes. Further, the constant threat of
infinite rule-application loops frightens me. Perhaps he can give
us a better idea of how often he hits a landmine. I may be incorrect
to focus on this problem.
One way to look at this problem is: do you want to drive the
translation based upon the input stream or based upon a set of
rules. Even if I have to collect information along the way, I
preferred to drive my translations based upon seeing input
constructs. I can step through easily with a debugger as it relates
directly to the input. Driving things from the abstract set of
rules, just doesn't feel natural to me. But that could be a personal
choice.
Perhaps one can look at this as ease of specification versus ease of
debugging and getting it right with all of these rules. Rules are
easier for smaller systems, but I don't think they scale very well
(not in functionality of course just in ease of understanding the
entire system). On the other hand, doing all of the work to get the
trees and the tree parser together can be a very high transaction
cost to get started. In the long run though, I am thinking that
manipulating internal data structures, collecting information you
need, and then ultimately generating code is the best approach for
large systems. That is not to say that we should not use the system
that makes us the most comfortable. Clearly Andy is an exceptional
developer and has produced some amazing translators. no one can argue
with Andy's results.
Personally, I think that a hybrid system that does most of the work
in a tree walker but with rules, for all of the special cases that
Andy has pointed out, would be great!
* Generality. From Andy's descriptions, he has many hand-built
functions asking questions about context. As Sohail has pointed out,
this is the same as building parsers by hand and they are very
specific to the language being translated. At this point, I am not
sure for what Andy uses ANTLR. Probably to build the rule-based
system he has.
In summary, I think that a rule-based system has its place and I
would like to try my hand at building one learning from Andy's
experience. For myself, I would limit its use to quick and dirty
translators that I need to build that have less than say 40 rules and
of course only in cases where I'm not concerned with speed.
I'm perfectly willing to be wrong on this, just as I felt others were
wrong about the LL parsing approach back in the days when LR was king
in the early 90s.
This is one of the best discussions we have ever had on this mailing
list a thing. Hooray!
Ter
More information about the antlr-interest
mailing list