[antlr-interest] philosophy about translation

Andy Tripp antlr at jazillian.com
Wed Oct 11 13:33:03 PDT 2006


Kay Roepke wrote:

>
> On 11. Oct 2006, at 20:29 Uhr, Andy Tripp wrote:
>
>>  Go ahead and and put
>> "Woods Eyes Masters" into a tree and then convert to Spanish.  You'll 
>> come back later
>> and say "...but my program would have to know the context to even  
>> see that it's talking
>> about Tiger!" and I'll grin and say "that's right."
>
>
> sorry, but I cannot refrain:
>
>     Lexer         ->        Parser     ->        AST
> "Woods Eyes Masters" -> NOUN VERB NOUN -> (VERB["Eyes"] SUBJECT 
> ["Woods"] OBJECT["Masters"])
>     Lexer         ->     Token stream  ->    Translation ("Las maderas 
> echan  miradas a los amos")
> "Woods Eyes Masters" -> NOUN VERB NOUN ->    Translation ("Las 
> maderas  echan miradas a los amos")
>
> Even if you do not consider my horrible spanish, I fail to see how  
> any approach could do this without knowing the context.

Right.

>
>
> I agree with you that the mechanics of how you organize your  
> translation become minor the farther you are into the project.
> You will eventually have to build up a lot of support code to do the  
> job (take a look at compilers, most often the actual
> lexing/parsing is the least part in generating the output. Much more  
> work has to be put into the type system, building up
> graphs to do optimizations, semantic checks, checking for invalid  
> operations etc.)

Right. So now you've got 100,000 lines of support code that was 
triggered by a single AST action:
"Here is a sentence, translate it by looking at the whole context".
At that point, is it even a "tree-walking architecture".

>
> Don't you have to know the types of the variables used in the source  
> and destination language? Can you actually do without
> a type system and/or symbol table? 

Sure, if you can just call a library function that searches a token 
stream to figure out a variable's type, then you don't
need to keep a separate table.

> I find it hard to picture doing  the right thing without trees, but 
> then again I might
> be missing a lot.

As I said, you've got maybe 100,000 lines of code that does all kinds of 
stuff to be able to translate sentences like
"Woods Eyes Masters". It knows that "Woods" is a person, it knows that 
"Masters" is an upcoming golf tournament.
I knows that this "Woods" person might actually be "eyeing" this 
particular tournament. It knows this is a sports page
headline. It would be able to handle "Andy Masters Woods", too (assuming 
it knows that I'm working on
my driving skills).

By the time you've got all that code working, the fact that you're using 
a tree structure is just a very minor footnote.
Your translation above is horribly wrong, and the AST didn't help you 
make it right one bit.

> I would expect to have the structure of the input (tree vs. flat  
> stream) to not have much influence on your ability to produce
> "natural" code. Both approaches force you to look all over the place  
> to determine the usage of, say the malloc() family, e.g.
> is it used to reallocing an array, to buffer up some characters etc.  
> These would obviously be coded quite differently in Java.
>
> When I see a "rule" like
>     v1 = v1 + v2 => v1 += v2
> I cannot help but seeing a tree. 

That's because it's too trivial of an example, and I think that's the 
trap that people keep falling into.
Try:
ADD A B TO C D GIVING E.
NAME PIX 9(5)V99 VALUE ALL SPACES.

or
main(int argc, char *argv[]);

The first ones is hard for you picture as a tree because it's COBOL and 
you don't even know right off
what they MEAN. As for the C example, I have trouble remembering what 
the tree is for "*argv[]",
and I'm frustrated that I'll have to deal with "char *argv[]", "char 
**argv", "char argv[][]", etc. individually.
I'd rather just once map them all to a single form by doing stuff like:
char **v --> char *v[]

> In the end, it's just another way to  specify a transformation, is it 
> not? I mean, what is the
> fundamental difference? Tokens that are close to each other in a  
> token stream most often end up close to each other in a tree for
> some metric, aren't they? Ok, they might end up in different branches  
> from a common interior node, but for really nasty stuff like
> variable decl vs. usage you have symbol tables. I feel like I'm  
> missing some important information here.

I think one thing that maybe you're missing is the scope of the problem 
that I'm dealing with.
You refer to "variable decl vs. usage" as "really nasty stuff". Using 
token streams, I find a declaration
by looking backwards for pattern "<type> v;" or "<type> v =". If I don't 
find that pattern, I search other files
for that pattern. That's pretty trivial code to search for that pattern, 
starting at a given token. That's not the
"nasty stuff", that's the trivial stuff!

The nasty stuff is things like looking at a variable that's an "int" or 
"long"
type, looking at every reference to see if it will be possible to change 
the variable type to "boolean", and then
doing so. Or looking at a pointer declaration and all its usages to see 
what the Java equivalent really is:
is it used solely as an array index? (if so, we'll replace the pointer 
by an "int", change the thing it points to
into an array, and change the syntax of every reference) Or is it a real 
pointer type, such
as in a LinkedList? Or is it just a pointer
so that we can pass an address to a function and have the function 
change the value? This kind of analysis
needs lots of heuristics and lots of work. So I sometimes feel like I'm 
just getting started when everyone else
has already given up.

I think most people here would get to "memset()" and simply say "sorry, 
there's no
Java equivalent", whereas I would then get going listing out the common 
cases: is memset being
used to initialize a variable which will automatically be guaranteed to 
be initialized in Java? Or do I need to
generate some initialization code? Can it just be replaced by a single 
"new" call? And so on.

>
> Like the others, I don't want to pick on your approach, but am very  
> interested in seeing it from a different angle. Your success
> with your translators points that you have found something that  
> works, and seems to work quite well.
> Though I must add, I'm also worrying about implicit loops in the  
> rules. While they seem to be pretty simple algebraically, detecting
> infinite loops is a surprisingly hard problem. Proving them to be  
> well-formed and closed can be non-trivial to say the least.
>
> cheers,
>
> -k

Thanks for the input. I think your response to translating "Woods Eyes 
Masters" illustrates my point perfectly.
That is, for real NLP, treewalking solves nothing. It's simply one way 
to start approaching the problem,
and by the time you tackle all the hard issues, treewalking is just a 
tiny piece of the work, and it's not clear (to me)
that it's even adding anything. To me, saying "high-level language 
translation is mostly a matter of walking an AST"
is like saying "Getting from NYC to LA is mostly a matter of getting on 
a plane". Yea, that may be true. But if
you're going to program a robot to get from NYC to LA, getting on the 
plane will be just one tiny piece. To the
point where it doesn't really matter whether the robot takes a plane or 
a car.

Andy :)

>
>
>



More information about the antlr-interest mailing list