[antlr-interest] postmortem

Thu Mar 13 08:00:40 PDT 2008

Thomas Brandon wrote:
> On Thu, Mar 13, 2008 at 7:58 AM, Andy Tripp <antlr at jazillian.com> wrote:
>   
>>  Jim Idle wrote:
>>     
>>> I think you miss the point. We can't 'know' that they did or didn't want a
>>> flat tree. Who is this someone that you are designating tasks like this to?
>>>       
>> Well, we can't "know' anything about what anyone wants, in general. The best
>>  we can do is make a best guess. And I think the best guess is that most
>> ANTLR users want a non-flat AST.
>>     
> Yes, most ANTLR users want a non-flat tree. But most (if not all) of
> these users do not want a parse tree. 
Agreed. But keep in mind that for many applications, simply walking the 
AST by hand is enough.
> The parse trees generated by
> ANTLR are not just like ASTs and cannot be used as such. Parse trees
> consist of standard AST nodes for all the actual language nodes and
> special parse tree nodes (of type ParseTree extending CommonTree) for
> the rule references. These parse tree nodes have a token type of 0.
> Thus you cannot use a tree parser against a parse tree and manually
> walking the tree would be complicated. Parse trees (as generated by
> ANTLR) are more of a debugging aid than a type of AST that
> automatically adds structure.
>   
...I think that in many cases processing the parse tree would be very 
simple:

if (!(ast instanceof ParseTree)) {
   doSomethingWith(ast);
} else {
   // ignore ParseTree nodes
}

> Perhaps you could add an output=CST (Concrete Syntax Tree, aka parse
> tree) option. But how would this work?
> Given a rule like:
> myop: modifier MYOP contents;
> what should our proposed CST constructor generate? What token types
> should rule references automatically generate? You could generate a
> token type named after the rule but producing ^(MYOP attributes MYOP
> contents) where the first MYOP is our auto-generated one mapping to no
> part of the input and the second is an actual token seems bizarre and
> likely to cause troubles. Maybe we could do ^(RULE_MYOP modifier MYOP
> contents) but do any of the users who don't want flat ASTs want that?
>   
No. Instead, how about saying "if there is exactly one terminal (i.e. 
lexer token or literal) in the rule,
put a ^ after that:

(attributes MYOP^ contents)
> And do they also want:
> modifier: PUBLIC | PROTECTED;
> to generate ^(RULE_MODIFIER PUBLIC)?
>   
Given that definition of modifiers, I'd want:

modifier: PUBLIC^ | PROTECTED^;

More realistically, they'll also have:
modifiers: modifier*;

...where I'd additionally want:
modifiers: ^(RULE_MODIFIERS modifier*)


> I'd imagine what they really want is not a CST but the AST ^(MYOP
> modifier content). With modifier having no dummy parent and contents
> having one.
>   
Right. So maybe these heuristics work:
* put a ^ after every literal.
* if we see more than one literal in a sequence (e.g. MYOP WHATEVER 
attributes contents), only the first one gets a ^
* a rule gets wrapped with ^(RULE_XXX ...) iff any of its alternatives 
is a non-literal

In that way, we get what we want:
modifier: PUBLIC^ | PROTECTED^;
myop: modifier MYOP^ contents;


> OK, so we don't want output=CST we want to auto-generate ASTs. But
> how? Given the above case we might think we could have a rule that if
> there's one token reference and other rule references we make the
> token the root. That's easy but what if we've got:
> method: keywords ID args catch;
> we probably don't want ^(ID keywords args catch) as that's very hard
> for our tree walker to distinguish from:
> field: keywords ID init;
> which makes ^(ID keywords init).
>   
Right, that's a problem. I guess I'd suggest making an AST node for 
*every* rule:
modifier: ^(RULE_MODIFIER (PUBLIC^ | PROTECTED^));
myop: ^(RULE_MYOP (modifier MYOP^ contents));

...but I guess maybe that puts us where we started - generating a full 
parse tree.
But even without solving this problem - go ahead and leave the tree as 
being hard to walk -
we're still ahead. The goal here is not to generate a "good" AST, but 
rather just produce something
that's better than nothing (i.e. a flat tree). So the newbie gets this 
tree, sees that it's hard to
distinguish the various cases where he's got an ID node, and starts 
reading up on how to build
an AST. He's better off here than the alternative: a blank stare at a 
flat ast.
> And what do we do with:
> method: keywords ID LPAREN args RPAREN CATCH catch;
> Here I'd probably want the AST ^(METHOD_CALL ID keywords args catch)
> but how can a tool know that.
>   
Here again, I think the "every rule gets its own AST" rule covers it. 
Yes, I suppose that
makes it a parse tree.
> OK, so we want to have some default rules and some syntax to disable
> automatic generation. 
Just use what's already there: if there are zero ^ characters, return a 
parse tree.
> But how often is this auto generation actually
> going to be used? I think you're very often going to want to disable
> any such automatic generation. OK, so any use of AST rebuild operators
> disables the automatic generation. 
Right.
> But what about your "modifiers:
> PUBLIC | PROTECTED;" rule? Adding "options { autoAST=false; }" to all
> such rules is going to be pretty annoying
No. No need for per-rule options, or even any new global options.
>  and "modifiers!: PUBLIC |
> PROTECTED;" isn't going to be very understandable for new users.
>   
Right, no need for that either.
> I think if you spend a bit of time actually thinking about how you'd
> manage to implement what you want you'll see it really doesn't work.
>   
Thanks for writing this up. Let me know if you think the replies I gave 
would work.
As I said, if they work, but only make things somewhat better, but not a 
completely usable AST, that's
still a win IMO. If they don't work, then how about simply say "if there 
are no ^'s, I'll just use
the existing code to get a parse tree and return that"?

Andy
> Tom.
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080313/f8474cd9/attachment.html