[antlr-interest] philosophy about translation

Fri Oct 27 16:12:59 PDT 2006

Jim Idle wrote:

>>I disagree. With ANTLR treewalkers or even any other tool and not 
>>treewalkers when you build
>>ASTs and then transform them to other ASTs, you have to be intimately 
>>familiar with the
>>shape of those ASTs (i.e. the grammar for the input and output 
>>languages). I'd rather not have
>>to know that.
>>    
>>
>
>I see no way to avoid this and produce a good result. 
>
I feel that I'm getting a "good result" for C, C++, and COBOL to Java
translation without being intimately
familiar with their AST structures. If you can point to a potential
problem spot, I can address what I do.

>However there are few languages such that being familiar with the type of tree that one language produces does not help with the tree that another produces. In fact I think that that TreeParser grammar is a huge aid to being able to 'read' the tree. 
>  
>
I'm sure that knowing the C tree structure helps a lot with C++. But
even comparing the C structure with the
Java one, there are quite a few differences for no real reason other
than the whim of the person who wrote each.
That really shows when you compare the two perfectly good, but
different, java.g files at antlr.org.

>>I know that the COBOL sentence:
>>ADD 1 TO A GIVING B.
>>...maps to the Java statement...
>>B = A + 1;
>>    
>>
>
>  
>
>>...and yet I have little clue as to what the COBOL or Java ASTs look like.
>>So I really do want to write:
>>ADD v1 TO v2 GIVING v3 --> v3 = v1 + v2; 
>>    
>>
>
>Taking your COBOL example though, I think that the issue of translating one language to another is much more complex in general than this and that the issue would be being intimately familiar with the languages, the tree surely being a relatively easy thing to pick up? 
>
Not for me. I've been programming in C for 25 years, and yet I didn't
know until I tried it what the "a.add(1)" tree looks like.
Sure, I could have guessed and gotten close, but close counts for
nothing in this context.

>What is the PIC of A and B for instance, where is the meta data about this to be stored (front end, encoded in IR, back end?), what significance does this have on the target language? What is the behavior of the VM when you produce System.out.println("String " + A); // What happens internally with A, will I produce code that cause STR->INT->STR conversion all the time. COBOL will reject things that don't fit the PIC... etc.
>  
>
Right. Those are all things I deal with, and they're all difficult. So
that's where I want to spend my time, not in worrying about
AST details.

>What happens with:
>
>MOVE MOUNTAIN TO MOHAMMED;
>
>A universal front end->IR->high level language methodology is probably not possible. 
>  
>
Right. Like replacing a C memset() call, it's not possible in general
probably. Yet it is possible in practice,
I believe. I know it would take a lot of work to convince you on that one.

>Surely the rule matching scenario would be able to formulate an unknown sequence of events such that ruletriggerA changes some part of the input which fires ruletriggerB, which changes some part of the input that fires ruletriggerA... 
>  
>
I don't have "triggers" firing rules, but rule firing order is very
important. And some rules might just keep firing until
they stop making changes.

>It would seem that one has a specific project "Source code for app A1 in lang L1 translated to A1 in lang L1", or "Any App AN in L1 to L2" or "Lang L1 to Lang L2" or "LN1 to LN2; N1 # N2" and so on. I will ignore A1->A2 ;-). 
>
>The amount of support library programming in lang L2 would probably far outweigh other issues and I think that assuming you can find good enough programmers (big if though I admit) that just rewriting it in L2 would be better anyway. There is probably no way to avoid the new source code looking like the input source code and that a programmer of LANG L2 would say "What the bejesus is this?"
>  
>
If you do lots of intelligent replacement, then the "support library"
can be very small - that's the case with Jazillian.
And if you don't do intelligent replacement, it's impossible anyway -
you're never going to write memset() in Java
no matter how great your programmers. That's why we have just two
classes of language translators today -
the kind that produce correct, horribly convoluted code (like Ephedra
for C to Java), and the kind that produce
not-guaranteed-correct, nice, readable code. Jazillian is the only
translator in that second category (except maybe
for this one: http://reinventsoft.com/intentionalcompilation.html)

>For a translation solution, I suspect then that you just "type it in" and end up with a tool specific to the thing you want to translate, starting with tree walkers then probably some manual hard coded passes. Of course, you could consider this rule set approach part of the latter phase with a more specific task at hand. I think that this yields a practical solution to the task in hand and that you could knock out 10 of these in the time taken to deal with more general solutions ;-)
>  
>
As for "just type it in", here is a small chunk of rules from Jazillian:
isalnum(x1) --> Character.isLetterOrDigit(x1)
isalpha(x1) --> Character.isLetter(x1)
iscntrl(x1) --> Character.isISOControl(x1)
isdigit(x1) --> Character.isDigit(x1)
isgraph(x1) --> x1 > '\u0020' && x1 < '\u007E' && !Character.isSpaceChar(x1)
islower(x1) --> Character.isLowerCase(x1)
isprint(x1) --> ( x1 > '\u0020' && x1 < '\u007E')
ispunct(x1) --> ( x1 > '\u0020' && x1 < '\u007E' &&
!Character.isSpaceChar(x1) &
& !Character.isLetterOrDigit(x1))
isspace(x1) --> Character.isWhitespace(x1)
isupper(x1) --> Character.isUpperCase(x1)
isxdigit(x1) --> Character.isDigit(x1) || ( Character.toLower(x1) >= 'a'
&& Char
acter.toLower(x1) <= 'f')
memchr(x1, x2) --> CStringUtils.memchr(x1,x2)
tolower(x1) --> Character.toLowerCase(x1)
toupper(x1) --> Character.toUpperCase(x1)

The development time for these is only slightly longer than the time to
type them.
Try implementing them using a treewalker and see how long it takes.

As for "hard-coded passes", I've found that I have so many of these
passes that I'd guess that
less than 1% of my code would be related to treewalking, if I used
treewalking. At that point,
you have to wonder whether it's even worth figuring out how ANTLR
treewalking works.

I agree that you have to build something specific to the input and
output language.
Sure, Terence has StringTemplate (and even just ANTLR itself before v3)
spitting out code in lots of different languages, but he has an unfair
advantage: he gets
to design the ANTLR input language and the scope of what ANTLR does. Try
spitting
out PIC clauses in just about any language other than COBOL...there just
is no real
equivalent.

Andy

>Jim
>
>  
>