Proxy Objects
Building Garbage Collected Circular References
by Matt SergeantAugust 07, 2002
In writing the XML::XPath
module, (a library that implements some of the XML Document Object
Model, or DOM) I came across a particular problem with Perl's memory-management mechanism. In this article, I will detail the problem
and demonstrate a technique for building data structures that do not
exhibit this problem.
The problem is circular references.
What Is a Circular Reference?
Circular references are simply self-referential data structures -- a complex data structure that at some point contains a reference to a part of itself further up in the hierarchy. They are a useful idiom when you need two parts of a structure to be able to refer to each other -- often we see them used in parent/child relationships, where a child object might need access to the methods in the parent:
The case we're concerned with here is an XML tree, where we have a root node, and each node can have one or more children. In an XML DOM, the child nodes need the ability to refer back to their parents:
my $parent_name = $node->getParentNode()->getNodeName();
While it is often extremely useful to encode this type of relationship in your code, it is rather problematic for Perl.
Reference Counting
Before we see why circular references are problematic, first we need to understand how Perl's memory management works. In order to return memory for use by other parts of your program when it becomes available, Perl uses a technique called "reference counted garbage collection".
By using a count on each variable within Perl of the number of
references to that variable (the number of other things in your program
that refer to it), Perl's garbage collector can ensure timely
destruction of lexical variables (those created with my).
Each time the Perl interpreter sees something referencing that
variable, it increments the variable's internal reference count. When
the reference count goes down to zero, that variable's memory is freed,
and its destructor is called.
You can see the reference count of a variable using the
Devel::Peek module:
use Devel::Peek;
my $x = "Hello";
my $y = \$x;
Dump($x);
Which outputs:
SV = PV(0x804b85c) at 0x8057620
REFCNT = 2
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0x805d800 "Hello"\0
CUR = 5
LEN = 6
The important field for these purposes is the REFCNT field, which
shows us there are two references to our variable - one for the
main copy of the variable (the one we see as $x) and
one for the reference that $y holds.
Reference Counting with Circular References
Reference counting works well until you need to build self-referencing data structures. Let's look again at the design of our XML DOM library, where children need access to their parent nodes. The DOM specification requires you to maintain a two-way relationship between the parent and the child nodes. This leads to our circular reference -- the parent holds a reference to the child, and the child holds a reference to the parent.
Circular References in Detail
Circular references occur when a variable either directly or indirectly refers to itself somehow. We can easily show this using hash references and Devel::Peek again:
use Devel::Peek;
for (1..1) { # enter scope here
my %x = ();
my %y = ();
$x{child} = \%y;
$y{parent} = \%x;
Dump(\%x);
} # leave scope here
Now we can see that both variables have a refcount of 2 (the 3 for %x
is because we have to take an extra reference in order to
Dump() it):
...
SV = PVHV(0x8060b40) at 0x805702c
REFCNT = 3 ### This is %x
...
Elt "child" HASH = 0x77420b6
SV = RV(0x8063008) at 0x804b494
REFCNT = 1
FLAGS = (ROK)
RV = 0x8056fe4
SV = PVHV(0x8060ba0) at 0x8056fe4
REFCNT = 2 ### This is %y
...
Elt "parent" HASH = 0xcc03940
SV = RV(0x806300c) at 0x8056f84
REFCNT = 1
FLAGS = (ROK)
RV = 0x805702c
Now problems arise. When %y goes out of scope, it
cannot be completely garbage collected, because its reference count was
2, not 1. %x also cannot be garbage collected (if it could,
then it would free the extra reference to %y) for the same
reason. So we end up with two zombie variables, which can neither be
garbage collected nor freed back to the system. This is why we get what
appears to be memory leaks when we use circular references. Had we
modified our for () loop to repeat more than once, we would
see our memory usage steadily growing.
Rather than proving this by watching our memory grow though, we can
demonstrate what his happening with objects. We can create a
simple object and output something in its DESTROY method:
package CircObj;
sub new { bless {}, shift }
sub parent { my $self = shift; @_ ? $self->{parent} =
shift : $self->{parent} }
sub child { my $self = shift; @_ ? $self->{child} =
shift : $self->{child} }
sub DESTROY { warn("CircObj::DESTROY\n") }
Now a normal instance of this will output the destroy method as soon as its scope exits:
{
my $x = CircObj->new();
warn("Leaving scope\n");
}
warn("Scope left\n");
Results in the following output:
Leaving scope
CircObj::DESTROY
Scope left
However, if we fill in our circular references, then we get a different result:
for (1..1) {
my $parent = CircObj->new;
my $child = CircObj->new;
$parent->child($child);
$child->parent($parent);
warn("Leaving scope\n");
}
warn("Scope left\n");
And we see the output:
Leaving scope
Scope left
CircObj::DESTROY
CircObj::DESTROY
What this means is our variables are only getting
DESTROYed when the Perl interpreter does its global
cleanup -- at the time our program exits, not in the timely manner we
are used to with normal objects. This may not be too problematic for
some scripts, but if you're running any sort of large program, or a long-running persistent interpreter like mod_perl, then you will see
your memory steadily grow as you do this repeatedly.
A common way to "fix" this problem is to use a manual
destructor -- a method or function call that breaks the circle somehow.
In our original hashes example, this is as simple as adding delete
$x{child} to our code (deleting $y{parent} is
optional then, as the circle is already broken), and in the object
example we can supply a destroy() method that the user can
call before exiting the scope. But neither of those options is terribly
user friendly, and I believe in letting people who use my modules to
expect it to "just work".

