[antlr-interest] philosophy about translation

Tue Oct 10 19:32:10 PDT 2006

Andy--

I deliberately chose your message to respond to because it captured the pragmatic viewpoint--"I had a problem; I solved it; I did not need to use trees"--quite effectively.  That is quite a bit different from the more usual "I tried to use ANTLR's tree facilities and discovered that writing them by hand and using a visitor is easy and clearly is the ONE TRUE SOLUTION to language processing" that appears in the group every so often.  (Although, truth be told, some of your early messages to the group were along that vein, but you have since worked on problems that require multi-pass recognition support and have broadened your horizons.)

"Thinking in trees" does not come automatically.  It is like learning LISP or Forth or one of the functional languages (and, for that matter, object-oriented programming:  there is a lot of badly designed and implemented C++ code out there).  For a time, working with trees is like slogging through molasses, and then you get the "Aha!" experience and things become easy.  It usually is not about designing the perfect tree structure; instead, it is about simplifying the recognition problem and expressing target language constructs in tree form.

If I were working on a natural language problem, would I use trees?  Sure!  Trees are a very convenient way of capturing semantics into syntactic form.  Would I generate output with a visitor?  For Esperanto or one of the Romance languages, possibly (for these languages, simple syntactic principles apply); for English, that is unlikely--too much "peephole" substitution is necessary to handle special cases.  Instead, I would most likely process the input into a canonical tree with a regular structure (converting input to tree form is about transforming the special cases) and do some form of rule-based substitution for all of the nasty colloquialisms and other special cases.  (In compiler parlance, that is called "peephole optimization", often an essential for generating good code.)

As to compilers and compiler technology:  you might be surprised.  "Compiler technology" and "language translation technology" are effectively synonyms, and the technical term for the translators you have discussed in this group is "cascading compiler" (or used to be; the term has pretty much disappeared from the literature as compiler technology research has declined in popularity).  Most language translation problems are much simpler than a conventional compiler and are solved using a subset of the algorithms that go into building an optimizing compiler.  Nothing that you have described doing is out of place in a conventional compiler environment.  You might find it interesting to take a look at GNU RTL, used to generate the peephole optimizer for the gcc toolset.

As to your "unconventional" approach:  I hate to say this, but everything that you have described doing is well documented in the literature.  I have noticed that "healthy disrespect for conventional approaches" usually translates to "I never check to see if I am reinventing the wheel or not".  One of the great resources on the web is Citeseer:  <http://citeseer.ist.psu.edu/>.  It has some of the good early papers on pattern matching and transformation systems.

Cheers!

--Loring

Andy Tripp <antlr at jazillian.com> wrote: Loring,

I tend to think of the issue as one of "what's the best overall approach 
for what you're doing?".
I feel like for what I'm doing, I find a token sequence easier to work 
with than a tree structure.
I did start out using AST transformation, but about 3 months into it, I 
rewrote everything I
had to get rid of the ASTs. I can easily picture " f() {" as 
a sequence of tokens,
but I can't easily picture it as a tree.

I certainly don't mean to say "no problem needs treewalkers". But I do 
think that as the
size of the task grows, the treewalker/AST approach gets less and less 
useful. I think if
you started out naively writing a natural language translator, you'd do 
a  tree structure.
But after a few months of trying to picture sentences as trees, you'd 
get frustrated.
On day, you'd spend 24 hours straight staring at the sentence "Woods 
Eyes Masters" and realize
it's hopeless.

I feel like that's what I went through with C to Java. To convert a 
typical "printf" call
into it's java equivalent, I'm just going to have to write a bunch of 
code to do that,
and having my input in a tree structure doesn't help me write that code.

I think there's too much "conventional thinking" going on that assumes 
that a language-to-language
(from one high-level language to another, including libraries, and 
producing "natural" code)
translator is similar to a compiler. It's as if we assumed that an 
English-to-Spanish translator
is best designed in the same way as an English-to-MorseCode translator.

I do agree that the tree transformation tools seem to be way too 
limited. TXL came as close
to doing what I needed as an English-to-Spanish dictionary comes close 
to what a UN
interpreter does. Not only that, an AST lets me easily find out what 
function I'm in. That's
as useful as the professional interpreter knowing that "house" is 
"casa"...yes, its necessary, but
if you think it's insignificant. Knowing the "house to casa" mappings is 
maybe 0.01% of the
work. And knowing where you are in the tree is also 0.01% of the work.

So I prefer these steps:
1) Lexing (input to token stream)
2) Search-and-replace (altering the token stream)
3) Repeat step 2, many times
4) Output (pretty-print token stream).

I think my unconventional approach comes from an unconventional starting 
point: the goal
of converting C, C++, COBOL to "natural" Java. Given that goal, the 
problem isn't a matter
of "walk a tree and be sure to do something with every node". It's far 
more complicated than
that. For starters, consider "what do I do at a FUNCTION_CALL node with 
text 'memset'? And
repeat that for each of the standard C library functions.

I agree I have a warped perspective :) Whether any term-rewrite system 
could ever be good
enough to make me reconsider, I don't know. I had looked into TXL and 
SemanticDesign's DMS,
and of course ANTLR. I had your typical 20-year programmer background 
with BS and MS in CS
degrees, experience with Lex and Yacc and similar tools, and I guess a 
health disrespect for
conventional approaches :)

Andy

---------------------------------
Stay in the know. Pulse on the new Yahoo.com.  Check it out. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20061010/d48dc385/attachment.html