[antlr-interest] philosophy about translation

Tue Oct 10 09:21:39 PDT 2006

Loring,

I tend to think of the issue as one of "what's the best overall approach 
for what you're doing?".
I feel like for what I'm doing, I find a token sequence easier to work 
with than a tree structure.
I did start out using AST transformation, but about 3 months into it, I 
rewrote everything I
had to get rid of the ASTs. I can easily picture "<type> f(<args>) {" as 
a sequence of tokens,
but I can't easily picture it as a tree.

I certainly don't mean to say "no problem needs treewalkers". But I do 
think that as the
size of the task grows, the treewalker/AST approach gets less and less 
useful. I think if
you started out naively writing a natural language translator, you'd do 
a  tree structure.
But after a few months of trying to picture sentences as trees, you'd 
get frustrated.
On day, you'd spend 24 hours straight staring at the sentence "Woods 
Eyes Masters" and realize
it's hopeless.

I feel like that's what I went through with C to Java. To convert a 
typical "printf" call
into it's java equivalent, I'm just going to have to write a bunch of 
code to do that,
and having my input in a tree structure doesn't help me write that code.

I think there's too much "conventional thinking" going on that assumes 
that a language-to-language
(from one high-level language to another, including libraries, and 
producing "natural" code)
translator is similar to a compiler. It's as if we assumed that an 
English-to-Spanish translator
is best designed in the same way as an English-to-MorseCode translator.

I do agree that the tree transformation tools seem to be way too 
limited. TXL came as close
to doing what I needed as an English-to-Spanish dictionary comes close 
to what a UN
interpreter does. Not only that, an AST lets me easily find out what 
function I'm in. That's
as useful as the professional interpreter knowing that "house" is 
"casa"...yes, its necessary, but
if you think it's insignificant. Knowing the "house to casa" mappings is 
maybe 0.01% of the
work. And knowing where you are in the tree is also 0.01% of the work.

So I prefer these steps:
1) Lexing (input to token stream)
2) Search-and-replace (altering the token stream)
3) Repeat step 2, many times
4) Output (pretty-print token stream).

I think my unconventional approach comes from an unconventional starting 
point: the goal
of converting C, C++, COBOL to "natural" Java. Given that goal, the 
problem isn't a matter
of "walk a tree and be sure to do something with every node". It's far 
more complicated than
that. For starters, consider "what do I do at a FUNCTION_CALL node with 
text 'memset'? And
repeat that for each of the standard C library functions.

I agree I have a warped perspective :) Whether any term-rewrite system 
could ever be good
enough to make me reconsider, I don't know. I had looked into TXL and 
SemanticDesign's DMS,
and of course ANTLR. I had your typical 20-year programmer background 
with BS and MS in CS
degrees, experience with Lex and Yacc and similar tools, and I guess a 
health disrespect for
conventional approaches :)

Andy