[antlr-interest] Token Stream vs. AST

Thu Oct 12 13:03:58 PDT 2006

Kay Roepke wrote:

>
> Hmm, to me, this really depends. I fully expect to have to read the  
> translated code closely. I'm just not sure of what kind of bugs  
> you're talking about here. The really glaring stuff that won't even  
> compile is not the issue I think, it's more the really subtle  
> behavior changing type of bug I'm wary about. Please don't spare me,  
> as I'm really interested in your experiences with this stuff.

One good example was a 5000-line JPEG processing library that we 
converted. 
IIRC, there was just one nasty runtime bug. It was similar to the issue 
of "strcpy(a, b)" not being
the same as "a = b;". I think it was a case where the C code was 
duplicating an array of
objects, and a real-world programmer (and Jazillian) would have just 
replaced it with a
simple Java assignment, and someplace the code really did need two 
distinct copies of the array,
and behavior changed subtly. Fortunately, there was a nice test suite, 
and it was just a matter
of time to track down the root cause.

So the compiler crowd is probably cringing to hear that, but it was 
still a big win.
We'd ported a C (BREW) app to Java. 10 minutes to run it through 
Jazillian, and few hours
hours to debug. Compare that to rewriting in Java by hand, and you've 
saved at least 90%
in time and effort, probably a lot more.

In all our other cases, the code seemed to always be very generic, 
high-level, business-logic stuff.
No unions, no bit twiddling, and very little pointer trickery. Subtle 
behavior changes are one of those
things that seem like they'd happen all the time, but actually don't 
seem to in practice.

In our most recent project, we converted 740,000 lines of C code to 
Java, and after a few weeks
of testing we still hadn't seen any subtle runtime bugs. I just IM'ed 
that customer and he confirmed
that he can't think of any subtle runtime errors that we introduced. It 
all depends on the app.
A Python person was wondering about converting the python interpreter to 
Java. Trying to
automatically convert that was clearly a lost cause.

>
>> In our case, we'll see if, say, a memset() call matches any  patterns of
>> usage that we've seen before and have some
>> Java equivalent. If not, we'll leave it in the code, give a  warning, 
>> and
>> you'll have a memset() call in the middle of your Java code.
>> So, obviously, our translator works much better on "vanilla business
>> logic" code than it does on low-level library code.
>
>
> That sounds like a good approach to me, because it will definitely  
> stand out. Most low-level things are specific to one language anyway,  
> so it's sensible to leave that alone if there's no good conversion  
> available, IMHO.

Yup, nothing stands out like a compilation error! So far, the "just let 
any pattern that you don't recognize flow through
from the C to Java" has worked really well in practice. Customers always 
ask for you to be sure to "flag" stuff that
you can't handle, and we do (in a log file). But then it seems like the 
developer who gets the Java code just
tries to compile it, and doesn't even bother looking at the log file. He 
sees a memset() or some call to his
old C-based library, and he knows what's going on.

>
>> typedefs are usually of the form:
>> typedef THIS IS THE REPLACEMENT    THING_TO_BE_REPLACED;
>> this one is of the form:
>> typedef PART_OF_REPLACEMENT THING_TO_BE_REPLACED REST_OF_REPLACEMENT.
>
>
> Well, for array types it definitely always is
> typedef char MYCHAR[25];
> This won't work (and it shouldn't, too):
> typedef char[25] MYCHAR;
> There's no surprise, is there?

Just to my Java-trained token-stream-focued eye. It's a really screwed 
up syntax, this C array declaration:
char mychar[25];
...and it did always bother me, even before Java. It actually "popped 
out" to my officemate last night.
He was looking in a book and saying "can you do that?", so it's not just 
me :)

>
>
>> This illustrates the AST vs. token stream mentality really well.  The 
>> "oh
>> no!" moment that I get when
>> I see this out-of-wack-token-sequence-meaning is a bad one. But  this
>> "I'm going to
>> now have to think a bit to make sure I understand this" thinking that
>> I'm experiencing here is very similar
>> to the "I'm going to have to think now about what the AST looks like"
>> feeling that I'd have to do ALL THE TIME
>> with ASTs.
>
>
> mmh. The typedef example is a pretty good indicator that ASTs are a  
> good way to abstract, isn't it?
> It's the parsers job to produce a helpful tree for translations and  
> this would look like
> (TYPEDEF MYCHAR char[25]) in this case, making it easy to replace  
> MYCHAR. The tokenstream would lose this information and would scatter  
> the knowledge of how this is to be done in other places, rather than  
> to keep the knowledge about the source language together in one place.
> E.g. for ANTLR, when I want to find out how something works in  
> syntactic terms, I go looking at the grammars and the trees they  
> procude.
> Then I know for sure what is legal and what is not.
> That's what I like about trees. Once done, they give me information  
> about structure in clear terms. If I look at token streams, that  
> information
> is hidden and I have to do the parsing in my head. YMMV.

Yea, this is what I was thinking at first too. But now I think...it's 
not that the AST is
"hiding the ugly token-sequence details from you". What the AST is doing 
is replacing
token-sequence-ugliness with AST-uglyness. Yes, you know longer care 
about "token structure",
but you do now have to care about "AST structure". Not only that, the 
"token structure" is usually
quite clean and intuitive (this typedef example is a rare exception). 
But the AST structure
is pretty arbitrary - we've got a C AST and a Java AST, and they're 
sometimes different for
identical constructs. Not only that, we actually have two java.g 
grammars, and they're
building different AST structures!

How am I supposed to easily convert a token stream to an AST in my head, 
when probably any
two people who write their own java.g's each come up with their own AST 
structure? When we
have "public final static void main(String[] args)", is there a 
MODIFIERS node in there?
What's the type of the "[" node again? I can't remember all that - I 
keep having to look back
at the grammar.

I'd much rather just remember that typedef looks like this:
typedef ....one or more replacement tokens... token_to_replace;
...and then get surprised by the syntax for typedefs involving arrays.
That's much easier (for me) than remembering what the TYPEDEF part of an 
AST looks like.

Andy.