[antlr-interest] C runtime Memory Usage

Sat Jan 24 16:34:27 PST 2009

At 13:01 25/01/2009, Jim Idle wrote:
 >> Strings in Java and C# are immutable; in C/C++ they're not, 
but
 >> they should be treated as if they were
 >err, not really. That's a completely arbitrary decisions that 
you
 >just made up.

No, I didn't.  In fact, most implementations of the C++ STL do 
something similar with std::string (copy-on-write), so it's not 
even unique to managed code.

 >But, my code has to take care of the fact that people will and 
DO
 >do this, and then they will wonder why the next time they ask 
for
 >the string for that token, they got the last change that they 
made.
 >In fact I would hazard a guess that if in fact you reference the 

 >$text, you are much more likely to want to change it than you
 >would be in an ordinary C program.

If they use $text = foo, then sure, they're trying to replace the 
text for the token.  But replace is not the same as 
modify.  Modifying the result of $text directly should be 
forbidden, which is easy if you just make it const.  And most of 
the time this sort of thing is confined to the lexer/parser, 
anyway.  It'd be less common at the tree parser level.

Tokens already have to support the idea of drawing their text 
either from the input stream (if they haven't been replaced as 
above) or from arbitrary text set by embedded code in the 
grammar.  So all you would need to do is to set the text the first 
time it is queried for.  You get performance benefits both ways, 
that way -- if they never ask, it never needs to query the token 
stream and allocate the memory, and if they ask multiple times 
then it only needs to do so once and doesn't waste additional 
memory.

Or perhaps another approach would be to more closely model how 
std::string works.  When retrieving the text as an ANTLR string 
and manipulating it with the ANTLR string manipulation functions, 
it's writable but performs a copy-on-write as needed to preserve 
referential integrity of other strings.  When retrieving a raw 
char* it only gives you a const one, to let you know that you 
should be using the ANTLR functions to modify it instead.

This isn't hard to do (especially single-threaded) and shouldn't 
impose much of a CPU performance penalty, and should improve 
memory performance dramatically whenever the text is being 
accessed.  (And in any non-trivial program, the text is bound to 
be accessed quite a bit.)

 >It IS trivial to take text from the token stream - it is just
 >pointers.

Which you have to iterate and assemble (for rule text, 
anyway).  So no, it's not trivial.  (It might be *easy*, but not 
trivial.)