[antlr-interest] What's coming for 2.8 (was Re: Build Question)

Tue Mar 12 22:37:03 PST 2002

In response to Ric's and Mike Tiller's comments, I thought that I'd 
post something on the AST construction extensions and tree grammar 
generation.  This stuff all works now, and I'm in the process of 
jumping through the necessary hoops so that I can release the code.  
If we get really lucky, Ter will manage to finish his code generation 
stuff in time for a 2.8 release.

C++/C# support will require under 100 lines of new code to support the 
extensions, excluding the translocation support which takes about 500 
LOC.  That enables language neutral AST rewriting.

My current code base is based on 2.7.1; Ter and I plan to upgrade to a 
2.7.2 base before releasing 2.8.

More detail:
As Ter noted a few months back, I have been working on a tree grammar 
generation phase for ANTLR.  Since that is nearing completion, now 
seems a good time for a status report and request input.

Tree grammar generation involves a transformation of loop 
structures--()+ and ()*--containing root elements (elements marked 
with a caret) into recursive rules.  Additionally, alternatives need 
to be split out and subtree structure must be tracked.

The tree rewriting required to transform an input grammar tree into an 
output grammar tree is rather demanding on ANTLR's tree rewrite 
support. In the course of development, I have had to fix a number of 
misfeatures.  BANG (!) did not work when applied to parenthesized 
expressions, including ()+ and ()* and ()? blocks.  Annotations--ˆ and 
!--did not behave quite as expected in tree grammars.
	#( ROOT! A B )
led to the output tree
	#( A B )
rather than the A B list that might have been expected.  Further, ^ 
had scope limited to the current tree structure within a rule rather 
than having rule-level scope.
	#( A #( B Cˆ D ) )
would generate
	#( A #( C #( B D ) ) )
and not
	#( C #( A #( B D ) ) )
These have been fixed.

Additionally, roots are limited to single values in 2.7--ranges and 
alternatives are not supported.  This may get fixed this in time for 
the 2.8.0 release.

ANTLR tree restructuring via annotations is powerful, but spartan.  I 
found that I needed a facility for reordering AST nodes and explicitly 
respecifying trees within rule definitions.  This led to the idea of 
extending the ANTLR meta-language to include generic restructuring.  
The added facilities would include support for long distance 
transposition of tree elements--while not needed for the current 
application, it is essential in other contexts.

The syntactic approach that I settled on was to mirror the normal tree 
grammar--#( ... )--and add extensions as needed.  #{ ... } is a "build 
tree" construct (identifiers and tree expressions within #{ ... } 
define an AST construction rather than grammar elements to be matched) 
and is intended to be mnemonic, with the braces suggestive of an 
action syntax.  Tree construction may be conditional; { ... }? 
delineates construction predicates when building trees--the analogy to 
semantic predicates should be clear.  Additionally, "COPY" (duplicate 
and insert duplicate into tree and "REUSE" (insert "as is"--usually 
used as an optimization to avoid the overhead of AST node allocation) 
are keywords which precede AST node descriptors.  "RETURN" is used to 
restart tree construction--a better name could probably be found, but 
the usual usage is #{ RETURN #( A B C ) } to specify the tree returned 
from a rule.

One feature that I have found useful in other contexts is to 
translocate (or duplicate) a subtree to another location.  To support 
this, an isolated string literal ("literal") marks an insertion point; 
assigning a value to a string literal ("lit" = #( A B C )) marks the 
data to be translocated.  Scoping is fairly simple:  if an insertion 
point is encountered before an initial assignment of data to that 
insertion point, then subsequent assignments will be to that insertion 
point until a new insertion point is encountered.  Similarly, if an 
assignment precedes the first insertion point for that string, then 
the insertion point is used to resolve insertion of the preceding 
assignments and subsequent assignments will be made to the next 
insertion point.  Insertions are always done in line:  there are no 
special notations for "insert as root" or "insert as child".   Any 
such rearrangements are made in subsequent AST transformation passes.

The simplest loop-to-recursion transformation for tree grammars is
	foo : ( A Bˆ C )+ ;
to
	foo : #( B foo1 C ) ;
	foo1 : #( B foo1 C A )
	         | A
		;
but they can quickly become complex when alternative roots or nested 
loops are involved.  The tree grammar generation code has been torture 
tested and seems to handle all possible cases, including support for 
the tree rewriting extensions (except for the translocation extension, 
which requires significant effort to support in generated tree 
grammars).

--Loring Craymer

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/