[antlr-interest] Re: trees with payloads??

Wed Nov 10 21:22:59 PST 2004

--- In antlr-interest at yahoogroups.com, "micheal_jor" <open.zone at v...>
wrote:
...
> I have my doubts about doubly-linked ASTs too. I don't understand
> however how you formed the opinion that arrays are memory hogs. I
> would have though they were otherwise. And, if performance is
> important, they [might] offer superior locality properties which makes
> for very efficient cache usage. I say might only because this also
> depends on the memory location of the objects in the array. 

Arrays are either part of AST memory, in which case they are of fixed
length and there is almost always wasted space, or they are allocated
as a separate object from the AST, in which case two pointers (down
and right) become one pointer and an object with pointers to a vtable,
a size field, and the array at a minimum.  This is offset by a savings
 for terminal (right == down == null) nodes, but that does not help
much with memory usage.  Arrays are memory hogs.

Locality is not a good argument.  The arrays would not be arrays of
ASTs, but of references to ASTs.  The ASTs themselves would be
scattered to much the same extent in the two cases, and the variable
length arrays would contribute to memory fragmentation.

...

> > One other note:  with payloads, there is no reason for Tokens to
> > differ from ASTs, except that an AST may have down and right fields
> > set.
...
> > Both ASTs and Tokens are then replaced by Carriers (probably with a
> > better name).  This would have a performance advantage for tree
> > construction in the Parser since the token Carriers can be linked to
> > form a syntax tree in the parser.
> 
> Hmm, I am more inclined to eliminate [the cost of] Token objects
> [almost] entirely. I say almost because a mechanism is still needed to
> transfer attributes like lexeme/line/col info to AST node objects.

Not that easy.  You could have a fixed lookahead array, although that
would not work if syn preds are present, containing token data that
was then copied into ASTs as needed.  That leads to a lot of
copying--not very helpful.

> The TokenStream interface in it's current form might not be the [best]
> way to go. Unless Token objects can be recycled efficiently.

Carriers can be very efficiently recycled--you can recycle carrier
nodes from a syntax tree to a Factory, which keeps them in a linked
list and then feeds them back on allocation requests.

> So how can lexical tokens be modelled as an integer value (or enum
> value) usually and yet parsers are able to extract additional info
> such as lexeme/line/col on-demand and, inexpensively?

Token management in lexers is an art.  You have to avoid copying data
as much as possible for performance reasons and minimize memory
allocation.  llk does a reasonably good job; I think that Chris Leung
took advantage of lessons learned from flex and other lexer
generators.  Performance paranoia helps.  Payload objects could help
here, too--you can keep text start and end pointers into a
memory-mapped text file instead of allocating and copying strings.

--Loring

> The concept of a Carrier objects could play a part but I don't know
> how 'cause I haven't found an answer to my conundrum yet  ;-)
> 
> Cheers,
> 
> Micheal
> ANTLR/C#

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/