[antlr-interest] Re: Translators Should Use Tree Grammars

Mon Nov 22 22:50:53 PST 2004

--- In antlr-interest at yahoogroups.com, "atripp54321" <atripp at c...> wrote:

> My feeling is that this analysis code dwarfs the symbol-table
> code in my app. In other words, the symbol table provides a set of
> methods which is just a small subset of a larger library.

A subset of a larger system, yes. The lexer and parser (part of the
[syntax] analysis phase) for instance would be larger in most language
processing projects.

> > The AST searching can be done by specifying patterns to match in your
> > tree-grammar. AST transformations can be similarly accomplished.
> 
> Right. The whole big question here is whether the ANTLR
> (or any other) AST matching-and-replacement syntax can do
> all sorts of complicated matching and transformations that
> happen in real-world translation projects, AND whether it's
> easier to specify in ANTLR than in Java (or C++, or whatever).
> 
> One example comes to mind: I replace printf() with System.out.println
> if the last two characters in the first argument are "\n".
> In Java, that's just "if (arg1AST.getText().endsWith("\n")) {"
> Now I'm sure there's no "ANTLR-syntax" for that sort of thing,
> and the solution is "just embed Java in that case".

- Specify a pattern to match FUNCTION_CALL subtrees or whatever
they're called.
- Embed action code that performs required semantic checks - e.g.
check that it is "printf" etc.
- Depending on the checks perform the transformation using ANTLR
syntax, embedded action code or a mixture (you might need to update
auxilliary structures such as the symbol table in addition to AST
changes).

> But my problem
> is that I think that almost *all* the cases are like that, so I
> have to embed *lots* of Java, and I'd have to end up doing what
> it appears was done with that ASPA tool...separate out all that
> Java from the grammar to keep my sanity. And that's kinda ugly
> I think.

I don't think Anakreaon's reason for using the annotation tool was to
keep his sanity. Perhaps he just wanted the tree grammar specified in
one place. The structural differences between a C and a Java AST might
have precluded the use of the tool had he been working on a C-to-Java
translator.

> But don't you want the searching and the transformation to be
> "tightly coupled"? For example, I have hundreds of straightforward
> rules that can just be specified with simple text like this:
> 
> int main(int v1, char *v2[]) --> public static void main(String[] args)
> 
> So the "matching pattern" and the "replacement pattern" are only
> separated by "->". I'd hate to have rules where the "matching"
> is specified in one place and the "replacement"
> is in another. Even worse, the "matching" and "replacement" would
> be in different languages. Not good.

ANTLR syntax can specify both matching and trasformation. As can Java.
Or even a custom DSL (as you have presumably implemented above). One
option saves more time than the others.

> > Most production-level ANTLR grammars have a mix of ANTLR directives
> > and action code. Incidentally, I don't recommend using a tool without
> > understanding it.
> 
> You don't recommend using a library, for example, where you
> understand the thing that it returns, but now how it works?
> For example, you wouldn't use a JPEG encoder library that
> produces a .jpg image without understanding JPEG or the
> algorithm that it used?

Not without understanding what JPEG is and that the library is encode
JPEG. Also what input it requires and what the output would be. Then I
might try it on a few sample taks. I might really have been looking
for an MPEG encoding library after all  ;-)

> I say "new JButton("hello");" and use the JButton class without
> any idea how it does its drawing. I really think it's ok
> to use ANTLR to parse an C source into an AST without understanding
> cgram.g and the ANTLR syntax that it follows. 
> (That's not to say I don't).

Apples and oranges. Your JButton analogy is more like "I can use ANTLR
without needing to fully understand the contents of Ter's PhD thesis."
You can use JButton with understanding Java. And import syntax and
semantics. Classpaths. Object construction.....

> > Symbol tables are usually built for fast querying of node attributes
> > only and are often not updatable (there is no need since source files
> > don't change during compilation). 
> 
> AHA! Yes! And there you have it! The ASTs DO change during
> translation, they change dramatically and often...that's
> what translation is.
> 
> > You can make them updatable if your
> > applications need it.
> 
> And wouldn't every translator have
> a constantly changing symbol table?

Not neccessarily. A Pascal-to-Modula2 or GWBASIC-to-VB translator
might not need an updatable symbol table. I'm just guessing here mind you.

> > > > > > 4. Comparing approaches by analyzing ease of change
> > > > > > ===================================================
> > 
> > > Right, but the whole point of the tree grammar is to minimize
> > > the amount of code that you have to write. What's the point
> > > of embedding 30,000 lines of code inside a 350 line grammar,
> > > if you could have just written 30,020 lines that do the same
> > > thing?
> > 
> > A little exagerated no doubt. 
> 
> How do you mean? I do have about 30,000 lines of translation
> code, and the "bare - do nothing" C and java treewalker grammars
> are several hundred lines. And it wouldn't take more than
> 20 lines to do an inorder tree traversal.

It would take rather more if you preserved context info as the
tree-grammar approach does.

As for the 30,000 lines, I wonder how much of that would be generated
in an equivalent "tree-grammar" project.

> > Most people are not using ANTLR to build their systems. If they were,
> > they'd [better] know ANTLR.
> 
> So most people won't want to use ANTLR to do AST matching
> and transformations, even if they do use it (casually)
> for lexing and parsing. i.e.

Don't know. I do know that most people aren't building langauge
processors explicitly and therefore have no need to use ANTLR or other
similar tools (including "tree libraries").

> By analogy, Swing is inherently easier than SWT because it's
> "just Java", the developer doesn't have to learn a whole
> new framework. 

To a Java developer, SWT is just Java too. Am I missing something or
are you saying Swing is easier because its from Sun and/or is included
in the standard java distributions.

> > The empty expression?. Well, writing a rule to differentiate this case
> > from the other seems easy enough - check for the EXPR's children.
> 
> Yes, it's easy enough in Java, just say:
> AST expr = ast.getFirstChild();
> if (expr == null) {
> 
> ...but how do you do that using the treewalker grammar?
> Maybe you can, but it's one more thing that someone's
> got to learn that they wouldn't have to learn if they didn't
> use a treewalker.

You weren't born with the ability and knowledge to write that Java
code snippet were you?  ;-)

> > It can be a lot more than an AST traversal. It depends on what I put
> > into my grammar. See my earlier comments.
> 
> Can be, but is it typically a lot more for translators that
> convert one high-level-language to another?

I'd suggest that depends on the langauges involved.

> > Both the input and the output of a tree parser (tree-parser !=
> > tree-walker) is an AST. It is an AST-to-AST transformer. In the
> > simplest case (the tree walker) it simply visits each node of the AST
> > in depth-first order. The input and output ASTs are structurally the
> same.
> > 
> > In more complex cases, the AST would be transformed and the input and
> > output ASTs would be structurally different.
> 
> Yes I know. I'm just not getting my point across :(
> 
> How about this...let's say you're picking apples from a real
> tree.

Oh come on!

> You put good apples into the basket and bad ones in
> the garbage can. One way to pick the apples is with a tree
> traversal...climb the trunk, go out on the first branch,
> look at the first apple, etc.
> 
> But is that the most natural way to pick the apples?
> Maybe not. You could just shake the tree and then sort
> out the apples after they've fallen.

Sorry, don't like my apples overly bruised  ;-)

> > Not at all. I might be able to implement my system with the string
> > hangling sheenanigans of sed/awk/perl etc. The problem isn't ever
> > "lexing" or "parsing". It's building a source-to-source translator or,
> > computing source code metrics etc.
> 
> Look, I said that lexers and parsers are candidates for
> BNF-like specification because it's clear what the code is that
> they should produce (it could be cryptic like lex/yacc or readable
> like ANTLR, but the functionality of the generated code for
> a lexer or parser is clear to all).

Clear to you. That is what you mean right. And by implication, tree
parsers aren't. I suggested you should learn more about them. Perhaps
do a simple source-to-source project to get a feel for them.

<SNIP>

> Show me a C token stream and I'll show you the AST that it
> should produce. But show me a C AST and try to figure out
> what the "equivalent" Java AST should be. Ask 100 people
> and you'll get 100 different outputs.

Didn't we discuss an issue with the C AST structure for
ARRAY_DECLARATOR?. It would seem AST structure isn't universally
uniform after all.

> Yes, you don't need to use lexers and parsers, but that's
> not the point. The point is that there's an inherent
> difference between {lexers, parsers, compilers} and
> {translators}.

Compilers are translators. Lexers, parsers and treeparsers are
components of translators.

> > You might want consider separating analysis phases (that gather the
> > info you need to make your synthesis decisions) from synthesis phases
> > (that perform the actual transformations).
> 
> As I've said, too late for that, I'm done already :)
> The problem is that when you've got hundreds of translation
> rules, there's no point in doing an "analysis phase" ahead of time.
> The analysis is (potentially) wrong after each rule firing.
> 
> Say we have variable of type 'long'. I have lots of rules that
> might change that type to an 'int', rename it, move the declaration,
> delete it altogether, etc.

Nevertheless, translators are [almost] always written to operate in
such phases with synthesis following analysis.

> No, I can't use Java without understanding Java syntax :)
> But I can use ANTLR to lex and parse
> without understanding the grammar syntax.

Only because someone else has done all the work. Same as using a Java
library without having to write the code. What if there were no ready
made C grammar? Or if (like the C++ grammar) it doesn't have AST
building directives?

> > I don't know of any open source example. Does Ter's multi-pass
> > translators and Monty's AREV Basic work count?.
> 
> Maybe...anyone got a link for those?

The ANTLR site has some info on both project as does Monty's site.

> > Even lint/splint etc?. 
> >Compaq's ESC?. You *are* performing static
> > semantic analysis and using the result to informs decisions about
> > which AST transformations to perform.
> 
> I hadn't looked at lint, assuming that it's not doing anything
> radically different than javac (which I did look at).
> I hadn't looked at splintor ESC. I'll take a look at them.
> Keep in mind that these are just checking and not
> translating.

Translating is checking followed by transforming followed by checking
then transforming then....

Cheers,

Micheal
ANTLR/C#

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/