[antlr-interest] philosophy about translation
Andy Tripp
antlr at jazillian.com
Wed Oct 11 16:12:11 PDT 2006
Andy Tripp wrote:
> Kay Roepke wrote:
>
>>
>> On 11. Oct 2006, at 22:33 Uhr, Andy Tripp wrote:
>>
>>> Kay Roepke wrote:
>>>
>>>> On 11. Oct 2006, at 20:29 Uhr, Andy Tripp wrote:
>>>>
>>>> Don't you have to know the types of the variables used in the
>>>> source and destination language? Can you actually do without
>>>> a type system and/or symbol table?
>>>
>>>
>>>
>>> Sure, if you can just call a library function that searches a token
>>> stream to figure out a variable's type, then you don't
>>> need to keep a separate table.
>>
>>
>>
>> Ok, then it simply boils down to you not being concerned with
>> redoing that every time you'd need to know a variable's type, right?
>
>
> Right. It's just that my getVariableType() method searches a token
> stream rather than a symbol table, that's all.
>
>> You could of course do the same thing in a tree, traversing it every
>> time you find a reference to a variable. I think this is bothering
>> me the most, I find it hard to grasp why you would like to do the
>> searching over and over, instead of simply recording the fact that
>> `foo' was declared as `int' and then use that information further
>> down the line.
>
>
> As I've said before, the reason is that I'd end up making lots of
> calls to keep the symbol table up to date. By "lots of calls"
> I mean, writing lines of code. Assuming that it's just as easy to
> write code (and call it) that looks at the token stream as to look
> at a symbol table, the only reason for a symbol table would be
> performance, which I'm not concerned about.
>
>> But I agree that from a functionality point of view it doesn't
>> matter how you end up knowing the type of something as long as you
>> can discover it.
>>
>>>> I find it hard to picture doing the right thing without trees,
>>>> but then again I might
>>>> be missing a lot.
>>>
>>>
>>>
>>> As I said, you've got maybe 100,000 lines of code that does all
>>> kinds of stuff to be able to translate sentences like
>>> "Woods Eyes Masters". It knows that "Woods" is a person, it knows
>>> that "Masters" is an upcoming golf tournament.
>>> I knows that this "Woods" person might actually be "eyeing" this
>>> particular tournament. It knows this is a sports page
>>> headline. It would be able to handle "Andy Masters Woods", too
>>> (assuming it knows that I'm working on
>>> my driving skills).
>>>
>>> By the time you've got all that code working, the fact that you're
>>> using a tree structure is just a very minor footnote.
>>> Your translation above is horribly wrong, and the AST didn't help
>>> you make it right one bit.
>>
>>
>>
>> heh heh. I never said that I actually know any Spanish ;) But I
>> could argue along the same lines: With a flat stream approach, you'd
>> still build up a hierarchical data structure of some sort to
>> discover the structure of sentences. You would have to able to
>> distinguish various grammatical structures (of which I simply don't
>> know the proper english terms, unfortunately. I'd need to look them
>> up, which I'm too lazy right now...), in order to retain the logical
>> structure of a sentence. You might not start from an AST, but you
>> still would implicitely build one.
>
>
>
> Yea, maybe. But it's one thing to use a tree-like data structure as
> one of many data structures during processing.
> It's quite another to architect the whole thing as a "tree-walking
> approach".
>
>>
>>>> When I see a "rule" like
>>>> v1 = v1 + v2 => v1 += v2
>>>> I cannot help but seeing a tree.
>>>
>>>
>>>
>>> That's because it's too trivial of an example, and I think that's
>>> the trap that people keep falling into.
>>> Try:
>>> ADD A B TO C D GIVING E.
>>> NAME PIX 9(5)V99 VALUE ALL SPACES.
>>>
>>> or
>>> main(int argc, char *argv[]);
>>>
>>> The first ones is hard for you picture as a tree because it's COBOL
>>> and you don't even know right off
>>> what they MEAN. As for the C example, I have trouble remembering
>>> what the tree is for "*argv[]",
>>> and I'm frustrated that I'll have to deal with "char *argv[]",
>>> "char **argv", "char argv[][]", etc. individually.
>>> I'd rather just once map them all to a single form by doing stuff
>>> like:
>>> char **v --> char *v[]
>>
>>
>>
>> True enough, things can get complicated real quick. But isn't the
>> value of an AST that you can formulate "rules" on the basis of the
>> _abstract_ structure of the input? How would you model something
>> like this with your rule based system? E.g.:
>> char **v vs. int **v vs. id *someObject (id being the generic
>> _pointer_ to an object in Objective-C) ?
>
>
> In my pattern matching system, I have simple "placeholders". I'd say:
> v1 **v2 --> v1 v2[][];
> The "v1" and "v2" will match only simple single-ID tokens (like "int")
> and things that look like variable references
> (like foo[3].bar). I also have an "x" placeholder that matches "zero
> or more characters" and handles
> nesting of braces like this:
>
> if (true) {x2} --> x2
>
>> Would you need one rule for each supported type instead of one rule
>> for all non-pointer types and one exception for pointer types?
>
>
> If things get non-trivial, I can mix this pattern language with code.
> For example, to verify that
> the "v1" in v1 **v2 --> v1 v2[][];" really is a type, I could say
> something like:
>
> class StarChecker extends PatternRule {
> StarChecker() {
> super("v1 ** v2", "v1 v2[][]");
> }
> boolean match(Source source) {
> if (super.match(source)) {
> String v1 = results.get("v1"); // get the text that "v1"
> placeholder matched
> if (source.isType(v1)) { // verify that it's
> really a type (e.g. using symbol table)
> return true;
> }
> }
> return false;
> }
>
>>
>>>> In the end, it's just another way to specify a transformation, is
>>>> it not? I mean, what is the
>>>> fundamental difference? Tokens that are close to each other in a
>>>> token stream most often end up close to each other in a tree for
>>>> some metric, aren't they? Ok, they might end up in different
>>>> branches from a common interior node, but for really nasty stuff
>>>> like
>>>> variable decl vs. usage you have symbol tables. I feel like I'm
>>>> missing some important information here.
>>>
>>>
>>>
>>> I think one thing that maybe you're missing is the scope of the
>>> problem that I'm dealing with.
>>> You refer to "variable decl vs. usage" as "really nasty stuff".
>>> Using token streams, I find a declaration
>>> by looking backwards for pattern "<type> v;" or "<type> v =". If I
>>> don't find that pattern, I search other files
>>> for that pattern. That's pretty trivial code to search for that
>>> pattern, starting at a given token. That's not the
>>> "nasty stuff", that's the trivial stuff!
>>
>>
>>
>> Sorry, I wasn't clear. Simply recording the type of a variable is
>> trivial. I was referring to the proximity of those nodes in the tree,
>> which could make it "nasty" in terms of traversal needed to find the
>> declaration when you see a usage.
>
>
> Seems pretty trivial to find the variable declaration either way.
>
>>
>>> The nasty stuff is things like looking at a variable that's an
>>> "int" or "long"
>>> type, looking at every reference to see if it will be possible to
>>> change the variable type to "boolean", and then
>>> doing so. Or looking at a pointer declaration and all its usages to
>>> see what the Java equivalent really is:
>>> is it used solely as an array index? (if so, we'll replace the
>>> pointer by an "int", change the thing it points to
>>> into an array, and change the syntax of every reference) Or is it a
>>> real pointer type, such
>>> as in a LinkedList? Or is it just a pointer
>>> so that we can pass an address to a function and have the function
>>> change the value? This kind of analysis
>>> needs lots of heuristics and lots of work. So I sometimes feel like
>>> I'm just getting started when everyone else
>>> has already given up.
>>>
>>> I think most people here would get to "memset()" and simply say
>>> "sorry, there's no
>>> Java equivalent", whereas I would then get going listing out the
>>> common cases: is memset being
>>> used to initialize a variable which will automatically be
>>> guaranteed to be initialized in Java? Or do I need to
>>> generate some initialization code? Can it just be replaced by a
>>> single "new" call? And so on.
>>
>>
>>
>> I think this where it gets interesting, too. Actually the lack of
>> tools that produce natural code in the target language is probably
>> the most important reason why many people wouldn't even consider
>> using a language translator for any sizable codebase, which has to
>> be maintained thereafter.
>
>
> Right. I'm trying to change the dynamics on that and get people to
> believe that it can be done.
> I believe my product does it, but it's still a tough sell. About 1/3
> of the people who come up to our booth
> at tradeshows are "skeptics". They come up, take a quick look, and
> then ask "how do you handle unions?" or
> "...memset?" or "multiple inheritence." By the time I've started
> explaining about how memset is almost always
> used to initialize a struct to zero, and we check for that sort of
> thing, they walk away. It's how we programmers
> naturally are: we sure want our compilers to work 100%, and it seems
> like a translator should, too. So since
> it's impossible, in theory, we go back and do it by hand (or don't do
> it). It never occurs to us that someone could simply
> automate the stuff that everyone's doing by hand and save everyone 90%
> of the time and effort.
>
> The traveling salesman problem is NP-complete, and yet we have no
> problem using algorithms that are less than
> perfect do the best they can because they're so much better than
> humans. It's a shame we can't seem to take
> the same approach with rewriting code to a new language.
>
>> Myself included. I have never considered even to look for a Java to
>> Objective-C translator to port StringTemplate, for instance. Not
>> that I'd think there is anything like that, but still.
>>
>>> Thanks for the input. I think your response to translating "Woods
>>> Eyes Masters" illustrates my point perfectly.
>>> That is, for real NLP, treewalking solves nothing. It's simply one
>>> way to start approaching the problem,
>>> and by the time you tackle all the hard issues, treewalking is just
>>> a tiny piece of the work, and it's not clear (to me)
>>> that it's even adding anything. To me, saying "high-level language
>>> translation is mostly a matter of walking an AST"
>>> is like saying "Getting from NYC to LA is mostly a matter of
>>> getting on a plane".
>>
>>
>>
>> I'm not an expert in linguistics, far from it, so I can't really say
>> anything for NLP.
>
>
> Yea, me neither. I was pretty shocked at how different the NLP
> approaches are from "compiler" approaches.
> Seemed like zero overlap. I'm still pretty shocked at how bad NLP
> seems to be, but I guess I have just
> one data point: babelfish.
>
>> But for artificial languages I think it helps solving real problems.
>> I don't think it is an accident that trees are in use for these
>> things for such a long time (scale: comp.sci.). Surely, it is a
>> silver bullet, as you have demonstrated. If it was the only way to
>> get things done, you couldn't have built what you have built.
>>
>>> Yea, that may be true. But if
>>> you're going to program a robot to get from NYC to LA, getting on
>>> the plane will be just one tiny piece. To the
>>> point where it doesn't really matter whether the robot takes a
>>> plane or a car.
>>
>>
>>
>> :D
>>
>> cheers,
>> -k
>>
>
>
More information about the antlr-interest
mailing list