[antlr-interest] philosophy about translation

Wed Oct 11 16:12:11 PDT 2006

Andy Tripp wrote:

> Kay Roepke wrote:
>
>>
>> On 11. Oct 2006, at 22:33 Uhr, Andy Tripp wrote:
>>
>>> Kay Roepke wrote:
>>>
>>>> On 11. Oct 2006, at 20:29 Uhr, Andy Tripp wrote:
>>>>
>>>> Don't you have to know the types of the variables used in the  
>>>> source  and destination language? Can you actually do without
>>>> a type system and/or symbol table?
>>>
>>>
>>>
>>> Sure, if you can just call a library function that searches a token  
>>> stream to figure out a variable's type, then you don't
>>> need to keep a separate table.
>>
>>
>>
>> Ok, then it simply boils down to you not being concerned with 
>> redoing  that every time you'd need to know a variable's type, right?
>
>
> Right. It's just that my getVariableType() method searches a token 
> stream rather than a symbol table, that's all.
>
>> You could of course do the same thing in a tree, traversing it every  
>> time you find a reference to a variable. I think this is bothering 
>> me  the most, I find it hard to grasp why you would like to do the  
>> searching over and over, instead of simply recording the fact that  
>> `foo' was declared as `int' and then use that information further  
>> down the line. 
>
>
> As I've said before, the reason is that I'd end up making lots of 
> calls to keep the symbol table up to date. By "lots of calls"
> I mean, writing lines of code. Assuming that it's just as easy to 
> write code (and call it) that looks at the token stream as to look
> at a symbol table, the only reason for a symbol table would be 
> performance, which I'm not concerned about.
>
>> But I agree that from a functionality point of view it  doesn't 
>> matter how you end up knowing the type of something as long  as you 
>> can discover it.
>>
>>>> I find it hard to picture doing  the right thing without trees,  
>>>> but then again I might
>>>> be missing a lot.
>>>
>>>
>>>
>>> As I said, you've got maybe 100,000 lines of code that does all  
>>> kinds of stuff to be able to translate sentences like
>>> "Woods Eyes Masters". It knows that "Woods" is a person, it knows  
>>> that "Masters" is an upcoming golf tournament.
>>> I knows that this "Woods" person might actually be "eyeing" this  
>>> particular tournament. It knows this is a sports page
>>> headline. It would be able to handle "Andy Masters Woods", too  
>>> (assuming it knows that I'm working on
>>> my driving skills).
>>>
>>> By the time you've got all that code working, the fact that you're  
>>> using a tree structure is just a very minor footnote.
>>> Your translation above is horribly wrong, and the AST didn't help  
>>> you make it right one bit.
>>
>>
>>
>> heh heh. I never said that I actually know any Spanish ;) But I 
>> could  argue along the same lines: With a flat stream approach, you'd 
>> still  build up a hierarchical data structure of some sort to 
>> discover the  structure of sentences. You would have to able to 
>> distinguish various  grammatical structures (of which I simply don't 
>> know the proper  english terms, unfortunately. I'd need to look them 
>> up, which I'm too  lazy right now...), in order to retain the logical 
>> structure of a  sentence. You might not start from an AST, but you 
>> still would  implicitely build one.
>
>
>
> Yea, maybe. But it's one thing to use a tree-like data structure as 
> one of many data structures during processing.
> It's quite another to architect the whole thing as a "tree-walking 
> approach".
>
>>
>>>> When I see a "rule" like
>>>>     v1 = v1 + v2 => v1 += v2
>>>> I cannot help but seeing a tree.
>>>
>>>
>>>
>>> That's because it's too trivial of an example, and I think that's  
>>> the trap that people keep falling into.
>>> Try:
>>> ADD A B TO C D GIVING E.
>>> NAME PIX 9(5)V99 VALUE ALL SPACES.
>>>
>>> or
>>> main(int argc, char *argv[]);
>>>
>>> The first ones is hard for you picture as a tree because it's COBOL  
>>> and you don't even know right off
>>> what they MEAN. As for the C example, I have trouble remembering  
>>> what the tree is for "*argv[]",
>>> and I'm frustrated that I'll have to deal with "char *argv[]",  
>>> "char **argv", "char argv[][]", etc. individually.
>>> I'd rather just once map them all to a single form by doing stuff  
>>> like:
>>> char **v --> char *v[]
>>
>>
>>
>> True enough, things can get complicated real quick. But isn't the  
>> value of an AST that you can formulate "rules" on the basis of the  
>> _abstract_ structure of the input? How would you model something 
>> like  this with your rule based system? E.g.:
>> char **v vs. int **v vs. id *someObject (id being the generic  
>> _pointer_ to an object in Objective-C) ?
>
>
> In my pattern matching system, I have simple "placeholders". I'd say:
> v1 **v2 --> v1 v2[][];
> The "v1" and "v2" will match only simple single-ID tokens (like "int") 
> and things that look like variable references
> (like foo[3].bar). I also have an "x" placeholder that matches "zero 
> or more characters" and handles
> nesting of braces like this:
>
> if (true) {x2} --> x2
>
>> Would you need one rule for each supported type instead of one rule  
>> for all non-pointer types and one exception for pointer types?
>
>
> If things get non-trivial, I can mix this pattern language with code. 
> For example, to verify that
> the "v1" in v1 **v2 --> v1 v2[][];" really is a type, I could say 
> something like:
>
> class StarChecker extends PatternRule {
>    StarChecker() {
>         super("v1 ** v2", "v1 v2[][]");
>   }
>   boolean match(Source source) {
>        if (super.match(source)) {
>            String v1 = results.get("v1");    // get the text that "v1" 
> placeholder matched
>            if (source.isType(v1)) {           // verify that it's 
> really a type (e.g. using symbol table)
>                return true;
>            }
>         }
>         return false;
> }
>
>>
>>>> In the end, it's just another way to  specify a transformation, is  
>>>> it not? I mean, what is the
>>>> fundamental difference? Tokens that are close to each other in a   
>>>> token stream most often end up close to each other in a tree for
>>>> some metric, aren't they? Ok, they might end up in different  
>>>> branches  from a common interior node, but for really nasty stuff  
>>>> like
>>>> variable decl vs. usage you have symbol tables. I feel like I'm   
>>>> missing some important information here.
>>>
>>>
>>>
>>> I think one thing that maybe you're missing is the scope of the  
>>> problem that I'm dealing with.
>>> You refer to "variable decl vs. usage" as "really nasty stuff".  
>>> Using token streams, I find a declaration
>>> by looking backwards for pattern "<type> v;" or "<type> v =". If I  
>>> don't find that pattern, I search other files
>>> for that pattern. That's pretty trivial code to search for that  
>>> pattern, starting at a given token. That's not the
>>> "nasty stuff", that's the trivial stuff!
>>
>>
>>
>> Sorry, I wasn't clear. Simply recording the type of a variable is  
>> trivial. I was referring to the proximity of those nodes in the tree,
>> which could make it "nasty" in terms of traversal needed to find the  
>> declaration when you see a usage.
>
>
> Seems pretty trivial to find the variable declaration either way.
>
>>
>>> The nasty stuff is things like looking at a variable that's an  
>>> "int" or "long"
>>> type, looking at every reference to see if it will be possible to  
>>> change the variable type to "boolean", and then
>>> doing so. Or looking at a pointer declaration and all its usages to  
>>> see what the Java equivalent really is:
>>> is it used solely as an array index? (if so, we'll replace the  
>>> pointer by an "int", change the thing it points to
>>> into an array, and change the syntax of every reference) Or is it a  
>>> real pointer type, such
>>> as in a LinkedList? Or is it just a pointer
>>> so that we can pass an address to a function and have the function  
>>> change the value? This kind of analysis
>>> needs lots of heuristics and lots of work. So I sometimes feel like  
>>> I'm just getting started when everyone else
>>> has already given up.
>>>
>>> I think most people here would get to "memset()" and simply say  
>>> "sorry, there's no
>>> Java equivalent", whereas I would then get going listing out the  
>>> common cases: is memset being
>>> used to initialize a variable which will automatically be  
>>> guaranteed to be initialized in Java? Or do I need to
>>> generate some initialization code? Can it just be replaced by a  
>>> single "new" call? And so on.
>>
>>
>>
>> I think this where it gets interesting, too. Actually the lack of  
>> tools that produce natural code in the target language is probably  
>> the most important reason why many people wouldn't even consider  
>> using a language translator for any sizable codebase, which has to 
>> be  maintained thereafter.
>
>
> Right. I'm trying to change the dynamics on that and get people to 
> believe that it can be done.
> I believe my product does it, but it's still a tough sell. About 1/3 
> of the people who come up to our booth
> at tradeshows are "skeptics". They come up, take a quick look, and 
> then ask "how do you handle unions?" or
> "...memset?" or "multiple inheritence." By the time I've started 
> explaining about how memset is almost always
> used to initialize a struct to zero, and we check for that sort of 
> thing, they walk away. It's how we programmers
> naturally are: we sure want our compilers to work 100%, and it seems 
> like a translator should, too. So since
> it's impossible, in theory, we go back and do it by hand (or don't do 
> it). It never occurs to us that someone could simply
> automate the stuff that everyone's doing by hand and save everyone 90% 
> of the time and effort.
>
> The traveling salesman problem is NP-complete, and yet we have no 
> problem using algorithms that are less than
> perfect do the best they can because they're so much better than 
> humans. It's a shame we can't seem to take
> the same approach with rewriting code to a new language.
>
>> Myself included. I have never considered even to look for a Java to  
>> Objective-C translator to port StringTemplate, for instance. Not 
>> that  I'd think there is anything like that, but still.
>>
>>> Thanks for the input. I think your response to translating "Woods  
>>> Eyes Masters" illustrates my point perfectly.
>>> That is, for real NLP, treewalking solves nothing. It's simply one  
>>> way to start approaching the problem,
>>> and by the time you tackle all the hard issues, treewalking is just  
>>> a tiny piece of the work, and it's not clear (to me)
>>> that it's even adding anything. To me, saying "high-level language  
>>> translation is mostly a matter of walking an AST"
>>> is like saying "Getting from NYC to LA is mostly a matter of  
>>> getting on a plane".
>>
>>
>>
>> I'm not an expert in linguistics, far from it, so I can't really say  
>> anything for NLP. 
>
>
> Yea, me neither. I was pretty shocked at how different the NLP 
> approaches are from "compiler" approaches.
> Seemed like zero overlap. I'm still pretty shocked at how bad NLP 
> seems to be, but I guess I have just
> one data point: babelfish.
>
>> But for artificial languages I think it helps  solving real problems. 
>> I don't think it is an accident that trees are  in use for these 
>> things for such a long time (scale: comp.sci.).  Surely, it is a 
>> silver bullet, as you have demonstrated. If it was  the only way to 
>> get things done, you couldn't have built what you  have built.
>>
>>> Yea, that may be true. But if
>>> you're going to program a robot to get from NYC to LA, getting on  
>>> the plane will be just one tiny piece. To the
>>> point where it doesn't really matter whether the robot takes a  
>>> plane or a car.
>>
>>
>>
>> :D
>>
>> cheers,
>> -k
>>
>
>