[antlr-interest] Token Stream vs. AST

Thu Oct 12 15:47:52 PDT 2006

Kay Roepke wrote:

>
> On 12. Oct 2006, at 22:03 Uhr, Andy Tripp wrote:
>
>> In all our other cases, the code seemed to always be very generic,  
>> high-level, business-logic stuff.
>> No unions, no bit twiddling, and very little pointer trickery.  
>> Subtle behavior changes are one of those
>> things that seem like they'd happen all the time, but actually  don't 
>> seem to in practice.
>>
>> In our most recent project, we converted 740,000 lines of C code to  
>> Java, and after a few weeks
>> of testing we still hadn't seen any subtle runtime bugs. I just  
>> IM'ed that customer and he confirmed
>> that he can't think of any subtle runtime errors that we  introduced. 
>> It all depends on the app.
>
>
> That's my experience, too (albeit with conversion by hand). A half- 
> well written program will try to confine the low-level stuff as much  
> as possible. Of course, things can get out of hand, but in the  
> general case...your work (and its apparent successes) continues to  
> impress me!

It's still too early to judge Jazillian's commercial success. The 
C-to-Java tool has been available for 18 months,
we're finishing COBOL-to-Java beta now, and starting C++ -to- Java beta. 
But the program does exactly what I had
hoped it would do, the existing customers are very happy. The customer's 
lead technical guy on that large project
was quite amazed. We had converted his 740,000 lines of C to Java, 
including automatically adding hooks into hibernate,
, his new web-based GUI, etc. Took 12 days :) Probably would have taken 
5 people working 6 months to a year
to do by hand.

>
> :) Here's another one:
> typedef void (*func)(int);

Function pointer syntax never looks right unless your eyes are crossed :)

>
> Wouldn't you yourself be the author of the grammar/AST? I cannot  
> imagine that you'd switch grammars like socks during a project, so  
> I'd assume that you know the structure pretty much off the top of  
> your hat, except maybe in a few border cases which you seldom  
> encounter. But for visualization there are tools available (like  
> graphviz, or custom made tree viewers) that make checking trees easy.  
> I find pen and paper useful, too ;)

I've only made slight modifications to Monty's C and GnuC ANTLR 
grammars. I have no problem with their design,
just saying that when I look at a chunk of code, a perfectly correct 
clear picture of the equivalent AST doesn't
pop into my head. I tried doing AST transforms for three months - really 
did give it a decent effort.

>
>> I'd much rather just remember that typedef looks like this:
>> typedef ....one or more replacement tokens... token_to_replace;
>> ...and then get surprised by the syntax for typedefs involving arrays.
>> That's much easier (for me) than remembering what the TYPEDEF part  
>> of an AST looks like.
>
>
> My approach would be to shape the tree to be uniform for all typedefs  
> in the parser and then let later stages work on that, instead of  
> propagating the special cases further down the stages. 

Right, me too.

> That's the  reason syntax trees are referred to as being abstract. I 
> love  abstraction, since it lets you talk/take care of things in  
> generalized manner, as long as you follow a certain terminology.  

I tend to think people think ASTs are more abstract than they really are.
OK, so there's no ";" or "]" stored in the AST - big deal.
I can still look at a C grammar and tell at a glance that it's C and not 
Java.
The grammar/AST is built to exactly match the language, so it's 
completely tied to that language.
The fact that it doesn't quite store every syntactic element doesn't 
mean it's really very abstract.

> That's why mathematical notation is so incredibly powerful: it let's  
> you built new things in terms of existing things. Whenever a notation  
> gets messy, you look at it and try to come up with a suitable set of  
> abstractions, to maximize your power of expressing things. Same way  
> with intermediate representations.

Mathematical notation is powerful that way, but at the expense of 
expressiveness. It's nice to be able to say
matrixC = matrixA * matrixB
...but only because anyone reading that would know what it means to 
multiply two matrices.
If you don't know matrix multiplication, then the "*" shorthand is more 
confusing than the "longhand" that it replaced.
Which raises the question: What's the point of source code? Is it to 
succinctly tell the computer what to do?
or is it to tell the reader what's being done? Well, it's both, so how 
do you balance those two often
conflicting goals?

> Ok, granted, you have already said that you have found it very hard,  
> if not impossible, to find a IR suitable for both Cobol and C to  
> Java, so your case might be an exception.
> Still, the power of having a translation engine (which suddenly  
> sounds a lot like a compiler, doesn't it) that has different front- 
> ends, an IR, and a back-end, is something that is very desirable. At  
> least to me. It appears you cannot share much code/many rules between  
> your C->Java and Cobol->Java products, or am I mistaken?

A good IR would be great, but I don't think it's really possible. Other 
language translators do claim to do this,
and obviously things like CIL
http://en.wikipedia.org/wiki/List_of_hello_world_programs#CIL  works 
fine for compilers.

I'd say about half of my C-to-Java rules are reused in COBOL-to-Java.
All the C-to-Java rules are reused by C++ - to - Java.

>
> -k