[antlr-interest] Token Stream vs. AST

Thu Oct 12 08:02:53 PDT 2006

(not replied as per Prashant's request)

On 12. Oct 2006, at 16:12 Uhr, Andy Tripp wrote:
> It's missing the semicolon :)

Ok, to my defense, it was about 4 am ;)

>> One thing I cannot on agree on, is that a translator is allowed  
>> to  introduce subtle bugs in my code.
>
> This is the mindset I'm trying to change...it's at the heart of  
> what I'm
> doing.
> Every translator that I know of other than Jazillian goes for
> correctness, and every one (that I've seen)
> produces something like 50 lines of code from the one-line "hello,
> world" program.
> I could talk all day about why it's perfectly reasonable to allow bugs
> to be introduced (as is done
> every time a human writes or rewrites anything), but I'll spare you :)

Hmm, to me, this really depends. I fully expect to have to read the  
translated code closely. I'm just not sure of what kind of bugs  
you're talking about here. The really glaring stuff that won't even  
compile is not the issue I think, it's more the really subtle  
behavior changing type of bug I'm wary about. Please don't spare me,  
as I'm really interested in your experiences with this stuff.

> In our case, we'll see if, say, a memset() call matches any  
> patterns of
> usage that we've seen before and have some
> Java equivalent. If not, we'll leave it in the code, give a  
> warning, and
> you'll have a memset() call in the middle of your Java code.
> So, obviously, our translator works much better on "vanilla business
> logic" code than it does on low-level library code.

That sounds like a good approach to me, because it will definitely  
stand out. Most low-level things are specific to one language anyway,  
so it's sensible to leave that alone if there's no good conversion  
available, IMHO.

> typedefs are usually of the form:
> typedef THIS IS THE REPLACEMENT    THING_TO_BE_REPLACED;
> this one is of the form:
> typedef PART_OF_REPLACEMENT THING_TO_BE_REPLACED REST_OF_REPLACEMENT.

Well, for array types it definitely always is
typedef char MYCHAR[25];
This won't work (and it shouldn't, too):
typedef char[25] MYCHAR;
There's no surprise, is there?

> This illustrates the AST vs. token stream mentality really well.  
> The "oh
> no!" moment that I get when
> I see this out-of-wack-token-sequence-meaning is a bad one. But  this
> "I'm going to
> now have to think a bit to make sure I understand this" thinking that
> I'm experiencing here is very similar
> to the "I'm going to have to think now about what the AST looks like"
> feeling that I'd have to do ALL THE TIME
> with ASTs.

mmh. The typedef example is a pretty good indicator that ASTs are a  
good way to abstract, isn't it?
It's the parsers job to produce a helpful tree for translations and  
this would look like
(TYPEDEF MYCHAR char[25]) in this case, making it easy to replace  
MYCHAR. The tokenstream would lose this information and would scatter  
the knowledge of how this is to be done in other places, rather than  
to keep the knowledge about the source language together in one place.
E.g. for ANTLR, when I want to find out how something works in  
syntactic terms, I go looking at the grammars and the trees they  
procude.
Then I know for sure what is legal and what is not.
That's what I like about trees. Once done, they give me information  
about structure in clear terms. If I look at token streams, that  
information
is hidden and I have to do the parsing in my head. YMMV.

-k