[antlr-interest] Re: trees with payloads??

Wed Nov 10 23:35:11 PST 2004

--- In antlr-interest at yahoogroups.com, "lgcraymer" <lgc at m...> wrote:
> 
> --- In antlr-interest at yahoogroups.com, "micheal_jor" <open.zone at v...>
> wrote:
> ...
> > I have my doubts about doubly-linked ASTs too. I don't understand
> > however how you formed the opinion that arrays are memory hogs. I
> > would have though they were otherwise. And, if performance is
> > important, they [might] offer superior locality properties which makes
> > for very efficient cache usage. I say might only because this also
> > depends on the memory location of the objects in the array. 
> 
> Arrays are either part of AST memory, in which case they are of fixed
> length and there is almost always wasted space, or they are allocated
> as a separate object from the AST, in which case two pointers (down
> and right) become one pointer and an object with pointers to a vtable,
> a size field, and the array at a minimum.

Fixed length array would typically waste space (or raise errors if
they are too small). Agreed.

ArrayLists-type objects on the other hand don't have to waste space.
They become one pointer to the ArrayList which holds a size field and
an array with [approx] one pointer per child. The vtable is per class
not per instance.

For leaf nodes (where right == down == null) - one pointer is saved
since only one is needed for the ArrayList pointer (which would be
null too).

For internal non-leaf nodes - if the ArrayList is of the exact size
needed, only the ArrayList's size field is extra.

Aren't there always more leaf nodes?

>  This is offset by a savings
>  for terminal (right == down == null) nodes, but that does not help
> much with memory usage.  Arrays are memory hogs.

Fixed-length ones would usually be. Not convinced this holds for
ArrayLists.

> Locality is not a good argument.  The arrays would not be arrays of
> ASTs, but of references to ASTs.

Good point. I did qualify my locality statement.

> The ASTs themselves would be
> scattered to much the same extent in the two cases, and the variable
> length arrays would contribute to memory fragmentation.

The variable length array *might* contribute to memory fragmentation
if there were frequent re-allocations due to arrays outgrowing their
current size. It would be quite easy to ensure this wasn't this case
during tree construction though.

> > > Both ASTs and Tokens are then replaced by Carriers (probably with a
> > > better name).  This would have a performance advantage for tree
> > > construction in the Parser since the token Carriers can be linked to
> > > form a syntax tree in the parser.
> > 
> > Hmm, I am more inclined to eliminate [the cost of] Token objects
> > [almost] entirely. I say almost because a mechanism is still needed to
> > transfer attributes like lexeme/line/col info to AST node objects.
> 
> Not that easy.  You could have a fixed lookahead array, although that
> would not work if syn preds are present, containing token data that
> was then copied into ASTs as needed.  That leads to a lot of
> copying--not very helpful.
> 
> > The TokenStream interface in it's current form might not be the [best]
> > way to go. Unless Token objects can be recycled efficiently.
> 
> Carriers can be very efficiently recycled--you can recycle carrier
> nodes from a syntax tree to a Factory, which keeps them in a linked
> list and then feeds them back on allocation requests.

OK, following on from this:

I can grok Token objects with Carrier objects as payload. Carrier is
passed unchanged to AST objects.

But Token objects and AST objects replaced by Carrier objects?. There
isn't a one-to-one match between Token objects and AST objects in many
 cases. How does the unified Carrier concept deal with this?

> > So how can lexical tokens be modelled as an integer value (or enum
> > value) usually and yet parsers are able to extract additional info
> > such as lexeme/line/col on-demand and, inexpensively?
> 
> Token management in lexers is an art.  You have to avoid copying data
> as much as possible for performance reasons and minimize memory
> allocation.  llk does a reasonably good job; I think that Chris Leung
> took advantage of lessons learned from flex and other lexer
> generators.  Performance paranoia helps.  Payload objects could help
> here, too--you can keep text start and end pointers into a
> memory-mapped text file instead of allocating and copying strings.

Yep, right you are. Performance tuning is hard. You would typically
need to copy selected substrings from the input buffer to another
buffer where the start/end indices refer. Else, you might end up
keeping a 100kB memory-mapped file hanging around just for the sake of
a few substrings.

Cheers,

Micheal
ANTLR/C#

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/