[antlr-interest] Re: AST factory / heterogeneous tree enhancement

Mon Oct 21 17:19:56 PDT 2002

Hi again,

Couldn't stay away so here's a more detailed reply.

--- In antlr-interest at y..., Terence Parr <parrt at j...> wrote:
> Ok, Loring and I have discussed the tree factory problems.  
> "micheal_jor" <open.zone at v...> brought them up regarding C# and 
> Ric seems to have fixed this for C++.  So, now the Java solution.
> 
> Here is the problem as I understand it.
> 
> 1. #[FOO] always builds an AST node of the default type because
>      the ASTFactory only knows about the default.

This is an accurate statement of the problem with AST construction 
with the Java codegen and [formerly] the C# codegen.

There is also a related issue when the nodetype is specified by 
annotating a token reference in a grammar:
      aRule
         : TOK1<AST=CustomNode.Tok1Node> TOK1<AST=CustomNode.Tok2Node>
         ;

One additional issue that I would like to introduce relates to token 
redefinition. How does one specify a custom ASTNodeType globally for 
terminals such as ID and PLUS that aren't originally defined in a 
tokens {..} section?

We have tokens defined in the lexer (and therefore without 
ASTNodeTypes) that were to be "importVocab'd" into the parser 
(parsers actually). We planned to add the ASTNodeTypes (which could 
be different for different parsers) in the token's section in the 
parser. Can we use the tokens {...} construct to do this with 
terminals like ID and PLUS?. Being forced to use per-TokenRef options 
is very wasteful/verbose since it will be the same for all IDs.

> In future if you say
> 
>      tokens {
>          PLUS<AST=PLUSNode>;
>          ...
>      }
> 
>      then I'll make action #[PLUS] create the right node.  You can
> also say
>      #[ID,"foo","VarNode"] (3rd arg is the type of node to create).

I presume you meant that _both_ #[PLUS] and #[PLUS, "sometext"] will 
be fixed. 

I kinda like the extended syntax - I view it (and the per-tokenRef 
option) as a sort of local override of the global 
TokenType==>ASTNodeType mapping established with setASTNodeClass and 
tokens {...}. 

In our [informal] ANTLR coding standards, using "local override" 
ASTNodeType constructs is the exception rather than the rule.

> 2. dup methods of ASTFactory don't respect the type of the nodes; it
>      uses default node type.  In future, i'll use 
> t.getClass().newInstance()
>      to do the dup.
> 

The dup() methods ultimately call the factory's create() method. Once 
the factory is able to create the right nodes based on it's type, the 
dup() methods should just work. At least that was the experience with 
C#.

> 3. hetero tree construction does not call the factory.  E.g.,
> 
>      anIntRule : INT<AST=INTNode> ;
> 
>      generates
> 
>      INTNode v = new INTNode(LT(1));
> 
>      but we need to instead generate:
> 
>      AST v = (AST)astFactory.create(LT(1),"INTNode");
> 
>      where the create(...) method is new and specifies the type to
>      create.  This will use newInstance() instead of "new" by 
>      default.
> 

This contradicts the "Heterogeneous AST" section of the reference 
manual which states that "ANTLR uses the factory to create nodes for 
which it does not know the specific type". 

My opinion is that ANTLR should always use the ASTFactory except for 
(1) the new extended AST construction syntax and (2) the per-tokenRef 
ASTNodeType option since they effectively "override" the factory's 
global view of Token==>ASTNodeType mappings specified with the 
setASTNodeClass and the "tokens {...}" options.

I can't actually remember what policy has been (or is to be) 
implemented in the C# codegen but, I remember that the pre-existing 
mechanism for reading the grammar file and loading the various 
options removed the distinction between per-Token and per-TokenRef 
ASTNodeType settings for grammar atoms. The GrammarAtom simply has an 
ASTNodeType attribute. 

So I guess for "all non-manual tree construction requests that 
involve per-token or per-tokenref ASTNodeType options" the C# codegen 
will (must?) always generate
   INTNode v = new INTNode(LT(1));

The specified ASTNodeType may not be the type associated with the 
TokenType in the ASTFactory's mapping table so it's safer to just 
bypass the factory entirely.

Incidentally, the same will be true of the extended manual tree 
construction syntax. 

> 4. If you define ID<AST=T> in tokens section then all code in 
grammar 
> "id:ID" should
>      define labels as "T id" not "AST id" nor labelASTType id.

Hmmm. Interesting. I don't think either of the C++ and the C# codegen 
do this. What would be the benefit?

Cheers,

Micheal

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/