[antlr-interest] Re: trees with payloads??

Thu Nov 11 00:13:16 PST 2004

--- In antlr-interest at yahoogroups.com, "micheal_jor" <open.zone at v...>
wrote:
> 
> --- In antlr-interest at yahoogroups.com, "lgcraymer" <lgc at m...> wrote:
> > 
> > --- In antlr-interest at yahoogroups.com, "micheal_jor" <open.zone at v...>
> > wrote:
> > ...
> > > I have my doubts about doubly-linked ASTs too. I don't understand
> > > however how you formed the opinion that arrays are memory hogs. I
> > > would have though they were otherwise. And, if performance is
> > > important, they [might] offer superior locality properties which
makes
> > > for very efficient cache usage. I say might only because this also
> > > depends on the memory location of the objects in the array. 
> > 
> > Arrays are either part of AST memory, in which case they are of fixed
> > length and there is almost always wasted space, or they are allocated
> > as a separate object from the AST, in which case two pointers (down
> > and right) become one pointer and an object with pointers to a vtable,
> > a size field, and the array at a minimum.
> 
> Fixed length array would typically waste space (or raise errors if
> they are too small). Agreed.
> 
> ArrayLists-type objects on the other hand don't have to waste space.
> They become one pointer to the ArrayList which holds a size field and
> an array with [approx] one pointer per child. The vtable is per class
> not per instance.

Each object has a pointer to the vtable.

> For leaf nodes (where right == down == null) - one pointer is saved
> since only one is needed for the ArrayList pointer (which would be
> null too).
> 
> For internal non-leaf nodes - if the ArrayList is of the exact size
> needed, only the ArrayList's size field is extra.

Not true--there are hidden fields:  pointer to vtable (vtable
reference), size of allocated memory (larger than the object) or
pointer to following memory, and size of object.  There may also be a
reference counter or other fields for garbage collection (depends on
the GC algorithm).

> Aren't there always more leaf nodes?

Yes, but not all leaf nodes have only null pointers (I'll call them
edge nodes).  The edge nodes are a fraction of the total.

> >  This is offset by a savings
> >  for terminal (right == down == null) nodes, but that does not help
> > much with memory usage.  Arrays are memory hogs.
> 
> Fixed-length ones would usually be. Not convinced this holds for
> ArrayLists.

I should also point out that memory is allocated in "chunks" to avoid
fragmentation.  There is a minimum amount of memory for each allocated
object to help avoid memory fragmentation when objects are deleted and
the memory reused for another create.  64 or 128 bytes is not unusual
for C/C++; Java and C# would be somewhat similar.

...
> > Carriers can be very efficiently recycled--you can recycle carrier
> > nodes from a syntax tree to a Factory, which keeps them in a linked
> > list and then feeds them back on allocation requests.
> 
> OK, following on from this:
> 
> I can grok Token objects with Carrier objects as payload. Carrier is
> passed unchanged to AST objects.

Conceptual block!  AST = Token = Carrier with Payload.

> But Token objects and AST objects replaced by Carrier objects?. There
> isn't a one-to-one match between Token objects and AST objects in many
>  cases. How does the unified Carrier concept deal with this?

Not all Tokens become ASTs, and some ASTs are user-created.  However,
all non-user-created ASTs have corresponding Tokens.  You gain
performance efficiency by using Carriers with Payloads for Tokens and
then reusing them as ASTs.

> > > So how can lexical tokens be modelled as an integer value (or enum
> > > value) usually and yet parsers are able to extract additional info
> > > such as lexeme/line/col on-demand and, inexpensively?
> > 
> > Token management in lexers is an art.  You have to avoid copying data
> > as much as possible for performance reasons and minimize memory
> > allocation.  llk does a reasonably good job; I think that Chris Leung
> > took advantage of lessons learned from flex and other lexer
> > generators.  Performance paranoia helps.  Payload objects could help
> > here, too--you can keep text start and end pointers into a
> > memory-mapped text file instead of allocating and copying strings.
> 
> Yep, right you are. Performance tuning is hard. You would typically
> need to copy selected substrings from the input buffer to another
> buffer where the start/end indices refer. Else, you might end up
> keeping a 100kB memory-mapped file hanging around just for the sake of
> a few substrings.

True, but a 100KB or even 100 MB memory-mapped file is not a big
expense--the OS virtual memory management takes care of that.  If you
only keep a few strings, then you only keep a few blocks of the file
memory-resident.  On the other hand, 20,000 string copies and creates
(5 chars per word, which is probably an overestimate) is a performance
hit.

--Loring

> Cheers,
> 
> Micheal
> ANTLR/C#

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/