[antlr-interest] Token Stream vs. AST

Thu Oct 12 13:41:48 PDT 2006

On 12. Oct 2006, at 22:03 Uhr, Andy Tripp wrote:

> In all our other cases, the code seemed to always be very generic,  
> high-level, business-logic stuff.
> No unions, no bit twiddling, and very little pointer trickery.  
> Subtle behavior changes are one of those
> things that seem like they'd happen all the time, but actually  
> don't seem to in practice.
>
> In our most recent project, we converted 740,000 lines of C code to  
> Java, and after a few weeks
> of testing we still hadn't seen any subtle runtime bugs. I just  
> IM'ed that customer and he confirmed
> that he can't think of any subtle runtime errors that we  
> introduced. It all depends on the app.

That's my experience, too (albeit with conversion by hand). A half- 
well written program will try to confine the low-level stuff as much  
as possible. Of course, things can get out of hand, but in the  
general case...your work (and its apparent successes) continues to  
impress me!

> A Python person was wondering about converting the python  
> interpreter to Java. Trying to
> automatically convert that was clearly a lost cause.

icky stuff ;) take a look at the perl source code and you know what a  
lost cause is :P

> Yup, nothing stands out like a compilation error! So far, the "just  
> let any pattern that you don't recognize flow through
> from the C to Java" has worked really well in practice. Customers  
> always ask for you to be sure to "flag" stuff that
> you can't handle, and we do (in a log file). But then it seems like  
> the developer who gets the Java code just
> tries to compile it, and doesn't even bother looking at the log  
> file. He sees a memset() or some call to his
> old C-based library, and he knows what's going on.

I second that. Log files are good for debugging if nothing else (my  
experience). Real problems should always be like a smack in the face,  
so they won't be swept under the carpet. My first try would be to  
compile it, too. If nothing else to point me to the problems quickly.

> Just to my Java-trained token-stream-focued eye. It's a really  
> screwed up syntax, this C array declaration:
> char mychar[25];
> ...and it did always bother me, even before Java. It actually  
> "popped out" to my officemate last night.
> He was looking in a book and saying "can you do that?", so it's not  
> just me :)

:) Here's another one:
typedef void (*func)(int);

> Yea, this is what I was thinking at first too. But now I  
> think...it's not that the AST is
> "hiding the ugly token-sequence details from you". What the AST is  
> doing is replacing
> token-sequence-ugliness with AST-uglyness. Yes, you know longer  
> care about "token structure",
> but you do now have to care about "AST structure". Not only that,  
> the "token structure" is usually
> quite clean and intuitive (this typedef example is a rare  
> exception). But the AST structure
> is pretty arbitrary - we've got a C AST and a Java AST, and they're  
> sometimes different for
> identical constructs. Not only that, we actually have two java.g  
> grammars, and they're
> building different AST structures!

Yeah, the structure of an AST is heavily influenced by what you want  
to achieve. Choosing a suitable intermediate representation (IR) is  
crucial. While I probably wouldn't choose the token stream as an IR,  
it surely is an option. But then I find it more natural to look at  
trees than at streams. Your mileage obviously varies...:)

> How am I supposed to easily convert a token stream to an AST in my  
> head, when probably any
> two people who write their own java.g's each come up with their own  
> AST structure? When we
> have "public final static void main(String[] args)", is there a  
> MODIFIERS node in there?
> What's the type of the "[" node again? I can't remember all that -  
> I keep having to look back
> at the grammar.

Wouldn't you yourself be the author of the grammar/AST? I cannot  
imagine that you'd switch grammars like socks during a project, so  
I'd assume that you know the structure pretty much off the top of  
your hat, except maybe in a few border cases which you seldom  
encounter. But for visualization there are tools available (like  
graphviz, or custom made tree viewers) that make checking trees easy.  
I find pen and paper useful, too ;)

> I'd much rather just remember that typedef looks like this:
> typedef ....one or more replacement tokens... token_to_replace;
> ...and then get surprised by the syntax for typedefs involving arrays.
> That's much easier (for me) than remembering what the TYPEDEF part  
> of an AST looks like.

My approach would be to shape the tree to be uniform for all typedefs  
in the parser and then let later stages work on that, instead of  
propagating the special cases further down the stages. That's the  
reason syntax trees are referred to as being abstract. I love  
abstraction, since it lets you talk/take care of things in  
generalized manner, as long as you follow a certain terminology.  
That's why mathematical notation is so incredibly powerful: it let's  
you built new things in terms of existing things. Whenever a notation  
gets messy, you look at it and try to come up with a suitable set of  
abstractions, to maximize your power of expressing things. Same way  
with intermediate representations.
Ok, granted, you have already said that you have found it very hard,  
if not impossible, to find a IR suitable for both Cobol and C to  
Java, so your case might be an exception.
Still, the power of having a translation engine (which suddenly  
sounds a lot like a compiler, doesn't it) that has different front- 
ends, an IR, and a back-end, is something that is very desirable. At  
least to me. It appears you cannot share much code/many rules between  
your C->Java and Cobol->Java products, or am I mistaken?

-k
-- 
Kay Röpke <kroepke at classdump.org>
classdump Software
Key fingerprint = A849 0F2C C322 4022 379E  8661 7E1B FE0D 4CD2 A6D0