[antlr-interest] philosophy about translation

Wed Oct 11 14:56:27 PDT 2006

On 11. Oct 2006, at 22:33 Uhr, Andy Tripp wrote:

> Kay Roepke wrote:
>
>> On 11. Oct 2006, at 20:29 Uhr, Andy Tripp wrote:
>>
>> Don't you have to know the types of the variables used in the  
>> source  and destination language? Can you actually do without
>> a type system and/or symbol table?
>
> Sure, if you can just call a library function that searches a token  
> stream to figure out a variable's type, then you don't
> need to keep a separate table.

Ok, then it simply boils down to you not being concerned with redoing  
that every time you'd need to know a variable's type, right?
You could of course do the same thing in a tree, traversing it every  
time you find a reference to a variable. I think this is bothering me  
the most, I find it hard to grasp why you would like to do the  
searching over and over, instead of simply recording the fact that  
`foo' was declared as `int' and then use that information further  
down the line. But I agree that from a functionality point of view it  
doesn't matter how you end up knowing the type of something as long  
as you can discover it.

>> I find it hard to picture doing  the right thing without trees,  
>> but then again I might
>> be missing a lot.
>
> As I said, you've got maybe 100,000 lines of code that does all  
> kinds of stuff to be able to translate sentences like
> "Woods Eyes Masters". It knows that "Woods" is a person, it knows  
> that "Masters" is an upcoming golf tournament.
> I knows that this "Woods" person might actually be "eyeing" this  
> particular tournament. It knows this is a sports page
> headline. It would be able to handle "Andy Masters Woods", too  
> (assuming it knows that I'm working on
> my driving skills).
>
> By the time you've got all that code working, the fact that you're  
> using a tree structure is just a very minor footnote.
> Your translation above is horribly wrong, and the AST didn't help  
> you make it right one bit.

heh heh. I never said that I actually know any Spanish ;) But I could  
argue along the same lines: With a flat stream approach, you'd still  
build up a hierarchical data structure of some sort to discover the  
structure of sentences. You would have to able to distinguish various  
grammatical structures (of which I simply don't know the proper  
english terms, unfortunately. I'd need to look them up, which I'm too  
lazy right now...), in order to retain the logical structure of a  
sentence. You might not start from an AST, but you still would  
implicitely build one.

>> When I see a "rule" like
>>     v1 = v1 + v2 => v1 += v2
>> I cannot help but seeing a tree.
>
> That's because it's too trivial of an example, and I think that's  
> the trap that people keep falling into.
> Try:
> ADD A B TO C D GIVING E.
> NAME PIX 9(5)V99 VALUE ALL SPACES.
>
> or
> main(int argc, char *argv[]);
>
> The first ones is hard for you picture as a tree because it's COBOL  
> and you don't even know right off
> what they MEAN. As for the C example, I have trouble remembering  
> what the tree is for "*argv[]",
> and I'm frustrated that I'll have to deal with "char *argv[]",  
> "char **argv", "char argv[][]", etc. individually.
> I'd rather just once map them all to a single form by doing stuff  
> like:
> char **v --> char *v[]

True enough, things can get complicated real quick. But isn't the  
value of an AST that you can formulate "rules" on the basis of the  
_abstract_ structure of the input? How would you model something like  
this with your rule based system? E.g.:
char **v vs. int **v vs. id *someObject (id being the generic  
_pointer_ to an object in Objective-C) ?
Would you need one rule for each supported type instead of one rule  
for all non-pointer types and one exception for pointer types?

>> In the end, it's just another way to  specify a transformation, is  
>> it not? I mean, what is the
>> fundamental difference? Tokens that are close to each other in a   
>> token stream most often end up close to each other in a tree for
>> some metric, aren't they? Ok, they might end up in different  
>> branches  from a common interior node, but for really nasty stuff  
>> like
>> variable decl vs. usage you have symbol tables. I feel like I'm   
>> missing some important information here.
>
> I think one thing that maybe you're missing is the scope of the  
> problem that I'm dealing with.
> You refer to "variable decl vs. usage" as "really nasty stuff".  
> Using token streams, I find a declaration
> by looking backwards for pattern "<type> v;" or "<type> v =". If I  
> don't find that pattern, I search other files
> for that pattern. That's pretty trivial code to search for that  
> pattern, starting at a given token. That's not the
> "nasty stuff", that's the trivial stuff!

Sorry, I wasn't clear. Simply recording the type of a variable is  
trivial. I was referring to the proximity of those nodes in the tree,
which could make it "nasty" in terms of traversal needed to find the  
declaration when you see a usage.

> The nasty stuff is things like looking at a variable that's an  
> "int" or "long"
> type, looking at every reference to see if it will be possible to  
> change the variable type to "boolean", and then
> doing so. Or looking at a pointer declaration and all its usages to  
> see what the Java equivalent really is:
> is it used solely as an array index? (if so, we'll replace the  
> pointer by an "int", change the thing it points to
> into an array, and change the syntax of every reference) Or is it a  
> real pointer type, such
> as in a LinkedList? Or is it just a pointer
> so that we can pass an address to a function and have the function  
> change the value? This kind of analysis
> needs lots of heuristics and lots of work. So I sometimes feel like  
> I'm just getting started when everyone else
> has already given up.
>
> I think most people here would get to "memset()" and simply say  
> "sorry, there's no
> Java equivalent", whereas I would then get going listing out the  
> common cases: is memset being
> used to initialize a variable which will automatically be  
> guaranteed to be initialized in Java? Or do I need to
> generate some initialization code? Can it just be replaced by a  
> single "new" call? And so on.

I think this where it gets interesting, too. Actually the lack of  
tools that produce natural code in the target language is probably  
the most important reason why many people wouldn't even consider  
using a language translator for any sizable codebase, which has to be  
maintained thereafter.
Myself included. I have never considered even to look for a Java to  
Objective-C translator to port StringTemplate, for instance. Not that  
I'd think there is anything like that, but still.

> Thanks for the input. I think your response to translating "Woods  
> Eyes Masters" illustrates my point perfectly.
> That is, for real NLP, treewalking solves nothing. It's simply one  
> way to start approaching the problem,
> and by the time you tackle all the hard issues, treewalking is just  
> a tiny piece of the work, and it's not clear (to me)
> that it's even adding anything. To me, saying "high-level language  
> translation is mostly a matter of walking an AST"
> is like saying "Getting from NYC to LA is mostly a matter of  
> getting on a plane".

I'm not an expert in linguistics, far from it, so I can't really say  
anything for NLP. But for artificial languages I think it helps  
solving real problems. I don't think it is an accident that trees are  
in use for these things for such a long time (scale: comp.sci.).  
Surely, it is a silver bullet, as you have demonstrated. If it was  
the only way to get things done, you couldn't have built what you  
have built.

> Yea, that may be true. But if
> you're going to program a robot to get from NYC to LA, getting on  
> the plane will be just one tiny piece. To the
> point where it doesn't really matter whether the robot takes a  
> plane or a car.

:D

cheers,
-k