[antlr-interest] New article on StringTemplates and Treewalkers

Wed Jan 11 11:08:43 PST 2006

Gregg Reynolds wrote:

> Andy Tripp wrote:
>
>> In a fit of reverse-writer's-block last night, I wrote down
>> some thoughts on AST treewalking and StringTemplate, titled
>> "Why I don't Use StringTemplate for Language translation"
>>
>> The article is here: http://www.jazillian.com/stringTemplate.html
>>
>
> Hi Andy,
>
> A few holes to poke in your article.  Which I mean in the nicest 
> possible way!
>
> From your paper:  "But the main rationale for separating the "view" 
> from the "controller" and "model" is so that we can have multiple 
> "views", and that we can easily change the "view" without having to 
> touch the "model" or the "controller. Certain applications may have 
> multiple "views" (ANTLR, for example, which takes a single input in 
> ANTLR-language, but generates Java code for Java programmers, C code 
> for C programmers, etc). But for other applications, such as a 
> "Any-dialect-of-C to Java" or "C or C++ to Java", the mapping is 
> many-to-one, not one-to-many."
>
> Isn't this a false dichotomy?  The same considerations apply to both 
> situations.  If antlr can do many-to-one (source grammar to a variety 
> of target languages) 

You mean "one-to-many", here, not "many-to-one", don't you? ANTLR itself 
has just one input language, and "many" output languages (C++, Java, C#).

> that is only because somebody took the trouble to write the target 
> generation code.  It's not one-to-many, but many one-to-ones.  This is 
> exactly what happens with a many-to-one mapping (variety of source 
> languages to one target language): for each source language somebody 
> has to take the trouble to write the transformation code, and you 
> again end up with many one-to-ones.

No, I don't think that ANTLR is many one-to-ones at all. There is only 
one input language, there is a lot of code to derive the output, and then
there are minor variations on the output to make it fit either C++, 
Java, or C# syntax.

>
> So if it is a problem for Antlr, it is the same problem for Jazillion 
> or any other code xformer, regardless of implementation technique.

I do agree that (and I'm not sure if this is your point or not) ANTLR 
and Jazillian seem like they should both be designed the same way.
That's why I wrote this - to try to understand why my "rule-based" way 
seems better to me, and yet the "treewalker" way seems better
to Terence and a lot of people in the translation world. I assume it's 
something I'm missing, but who knows?

>
>
> Actually I think "MVC" is probably not the best idiom for discussion 
> parsing and transformation, coming as it does from the world of 
> graphical representation of data.  (Personally I don't find it useful 
> to think of the result of a translation as a "view" of the source; 
> e.g. calling the parser code generated by Antlr a "view" of the source 
> grammar doesn't work for me.  

Me neither, I hope I didn't say that.

> Nobody considers the machine code emitted by a compiler to be a "view" 
> of the source code.)

Ah, but they do. I do, and  that's exactly what Terence is saying in the 
StringTemplate article...that the target Java, python, and bytecode
are simple three slightly different "views" of the output. I agree with 
that.

>
> The real question is not separation of m v and c, but of the 
> *genericity* (adaptability, flexibility, whatever) of the "service": 
> given a parser generator, is its backend architecture general enough 
> to make it easy to write specialized emitters?  Given a language 
> transformer (e.g. Jazillion), is its frontend architecture general 
> enough to make it easy to specialize it for a variety of input languages?

In my case, I haven't cared too much (yet) that the frontend by able to 
handle multiple input languages (or that the backend be able
to output multiple languages for that matter). Just a single C-to-Java 
translator is hard enough, and I've been happy to spend 3 years full time
thinking about all the ways to do that really well, rather than 
expanding my scope. Having said that, I'm now working on C++ to Java, 
though :)

>
> More specifically:  how hard would it be to write an ML or Haskell 
> emitter for Antlr (something I'd like to see)?

Good question, and my related question is "will StringTemplate make that 
any easier?".

>
>
> How hard would it be to write an ML or Haskell front-end for 
> Jazillion?  (I mean relative to a C frontend, not relative to a 
> backend to Antlr, which would no doubt be easier.)

Answer: very hard: the translation rules are all C-specific. To put it 
bluntly, the Jazillian "front-end" is not in any way separated from the 
"engine"
and "backend". I believe it's impossible to design such a 
any-language-to-any-language translation engine, despite the fact that
Semantic Designs claims to have such a product.

>
> (Note GCC is a good example of genericity both on the front and back 
> ends.)

Right, I'm familiar with the gcc 4.0 architecture. IIRC it only supports 
C/C++ with gcc-specific extensions and Java as input,
and executable and Java bytecode as output. Good luck on getting it to 
input or output ML, Haskell, or Lisp :)

>
> A general observation:  you contrast the Antlr (AST) approach to 
> "pattern-matching" in a few places (e.g. "is what you've got using 
> StringTemplates and AST walking better than what you'd have with some 
> (unspecified here) pattern-matching approach?"
>
> But parsing *is* pattern matching, no?  So it isn't clear (to me) what 
> exact contrast you're trying to establish.

I'm not refering to ANTLR parsing here, but ANTLR treewalking. But yes, 
we could consider treewalking to be "pattern-matching on
two-dimensional trees", while I'm saying I prefer "pattern-matching on 
one-dimensional token streams". Simply because it's trivial to
form mental pictures of token streams. When we read "int[] i;", our 
brain has already tokenized it into a sequence of 5 tokens:
int [ ] i ;
But given that same chunk of code, our brains to NOT easily form an AST 
structure:
VAR_DEC
     TYPE "int"
     ARRAY_DEC  "[]"
     NAME "i"

Avoiding mental pictures of AST trees altogether is just a HUGE 
productivity boost, at least for me.
I'd say I'm at least twice as productive in writing rules (both simple 
text-replacement ones and
complex ones written in Java code), and probably more like 5-10x more 
productive
by largely ignoring AST structures.

>
>
> One of the examples you give to illustrate the difficulty of AST-walking:
>
>     2.  At any "printf function" node, loop through the format string 
> and arguments, and do lots of processing to replace them with Java 
> using the "+" operator.
>
> My understanding is that you would just write a production for the 
> grammar of the args of the printf function, which you could take 
> directly from the C grammar, augmented by info from the printf 
> definition in the library.  The "lots of processing" must occur 
> regardless of implementation strategy, but in Antlr the grammar 
> recognition part (looping through the format string and args) is clear 
> and simple(?).

That's right, except I disagree about the "clear and simple". It's not 
all that hard, but it's also not trivial. I prefer to just use plain old 
Java.
To parse arguments, for example, I have a Java method that loops through 
a token list, looking for commas and doing a little
bit of logic with matching parens/brackets/braces:

List<List<Token>> parseArgs(List<Token> tokens);

Each argument is returned as a List of Tokens, and so all arguments are 
just a List of Lists of Tokens.
To someone who knows ANTLR well, this approach may not seem any easier, 
but I do.
Then again, I hate it when I see CSS syntax inside HTML, so maybe it's 
just me :)

>
> Correct me if I'm wrong, but I get the impression you're thinking 
> about writing by hand a bunch of the AST parsing logic that Antlr 
> generates automatically for tree grammars, rather the way you might 
> need to proceed if you were using a less sophisticated parser 
> generator (lex/yacc, etc.)

No, No. I use ANTLR for all lexing and parsing, I would never do that by 
hand. It's just that I'm doing translation from token-stream
to token-stream (generally avoiding parsing) rather than from AST-to-AST.

> In that case, yes, it would definitely be a pain because you might 
> need to do it all by hand.  But if I understand Antlr correctly, it 
> saves you the trouble by supporting tree grammar.  So the interesting 
> contrast is not necessarily between your approach and Antlr's, but 
> between Antlr v. other parser generators.

One key difference with ANTLR vs. Jazillian is that Terence gets to 
design the ANTLR input grammar, but I didn't get
to design the C input grammar. This way, the ANTLR input grammar is well 
designed, and designed to do everything that
ANTLR needs it to do. By contrast, the C grammar isn't quite so great. 
For example, I want to handle preprocessor stuff
(#include, #define, etc) with the same approach that I use for 
everything else. And yet the C language grammar knows
nothing at all about preprocessor directives.

>
> All for now.  I'm not sure I agree with your paper, but it has 
> certainly provoked thought.
>
> -gregg

Thanks for the input.
Andy