«Cambridge CB3 0FD, U.K. IBM T.J. Watson Research Center P.O. Box 704, Yorktown Heights, NY 10598, U.S.A. Abstract. There are many algorithms for ...»
Derivation and Evaluation of Concurrent Collectors
Martin T. Vechev1, David F. Bacon2, Perry Cheng2, and David Grove2
Computer Laboratory, Cambridge University
Cambridge CB3 0FD, U.K.
IBM T.J. Watson Research Center
P.O. Box 704, Yorktown Heights, NY 10598, U.S.A.
Abstract. There are many algorithms for concurrent garbage collection, but they
are complex to describe, verify, and implement. This has resulted in a poor understanding of the relationships between the algorithms, and has precluded systematic study and comparative evaluation. We present a single high-level, abstract concurrent garbage collection algorithm, and show how existing snapshot and incremental update collectors, can be derived from the
algorithm by reducing precision. We also derive a new hybrid algorithm that reduces ﬂoating garbage while terminating quickly. We have implemented a concurrent collector framework and the resulting algorithms in IBM’s J9 Java virtual machine product and compared their performance in terms of space, time, and incrementality.
The results show that incremental update algorithms sometimes reduce memory requirements (on 3 of 5 benchmarks) but they also sometimes take longer due to recomputation in the termination phase (on 4 of 5 benchmarks). Our new hybrid algorithm has memory requirements similar to the incremental update collectors while avoiding recomputation in the termination phase.
1 Introduction The wide acceptance of the Java programming language has brought garbage collected languages into the mainstream. However, the use of traditional synchronous (“stop the world”) garbage collection is limiting the domains into which Java and similar languages can expand. The need for concurrent garbage collection is primarily being driven by two trends: the ﬁrst is increased heap sizes, which make the pauses longer and less tolerable; the second is the increase in the use of, and complexity of, real-time systems, for which even short pauses are often unacceptable. Therefore there is need for rapid improvement in various kinds of incremental and concurrent collector technology.
Unfortunately, concurrent garbage collectors are one of the more difﬁcult concurrent programs to construct correctly. The study of concurrent collectors began with Steele , Dijkstra , and Lamport .
Concurrent collectors were considered paradigmatic examples of the difﬁculty of constructing correct concurrent algorithms. Steele’s algorithm contained an error which he subsequently corrected , and Dijkstra’s algorithm contained an error discovered and corrected by Stenning and Woodger . Furthermore, some correct algorithms  had informal proofs that were found to contain errors .
These problems also manifest themselves in practice because concurrent bugs generally have a non-deterministic effect on the system and are non-repeatable, so that connecting the cause of the error to the observed effect is particularly difﬁcult.
Many incremental and concurrent algorithms have been introduced in the last 30 years [1, 3, 4, 6, 7, 10, 11, 12, 13, 16, 17, 18, 19, 21, 23, 24], but there has been very little comparative evaluation of the properties of the different algorithms due to the complexity of implementing even one algorithm correctly. As noted in , because of these constraints, current state-of-the-art concurrent systems are generally not quantitatively compared against each other and the exact relationships among the different concurrent schemes are largely unknown.
For example, early collectors were all examples of incremental update collectors which “chase down” modiﬁcations to the object graph that are made by the program during collection. Yuasa  introduced snapshot collectors, which do not attempt to collect garbage allocated after collection begins, but do not require any rescanning of the object graph. Thus, snapshot collectors trade off reliable termination for a potential increase in ﬂoating garbage. However, costs and beneﬁts relative to incremental update techniques have not been systematically studied.
This paper presents a high-level algorithm for concurrent collection that subsumes and generalizes several previous concurrent collector techniques. This algorithm is signiﬁcantly more precise than previous algorithms (at the expense of constant-factor increases in both time and space), and more importantly yields a number of insights into the operation of concurrent collection. For instance, the operation of concurrent write barriers can be viewed as a form of degenerate reference counting; in our algorithm, we do true reference counting and are thereby able to ﬁnd live data more precisely.
Existing algorithms can then be viewed as instantiations of the generalized algorithm that sacriﬁce precision for compactness of object representation and speed of the collector operations (especially the write barriers).
Additionally, we argue that all of the existing concurrent algorithms fundamentally share a deeper structure. And there is a whole continuum of existing algorithms, which we have not yet explored, but could be uncovered if we start from such a structure.
Moreover, by having a common abstract algorithm, much of the construction of the practical collector will be simpliﬁed.
The contributions of this paper are:
– A generalized, extendable, abstract concurrent collection algorithm, which is more precise than previous algorithms;
– A demonstration of how the abstract algorithm can be instantiated to yield existing snapshot and incremental update algorithms;
– A new snapshot algorithm (derived from the abstract algorithm) that allocates objects unmarked (“white”) and reduces ﬂoating garbage without re-scanning of the heap required by incremental update algorithms;
– An implementation of four concurrent collectors in a production-quality virtual machine (IBM’s J9 JVM product): Snapshot (after Yuasa), two incremental- update (after both Dijkstra and Steele), and our hybrid snapshot algorithm; and – A quantitative experimental evaluation comparing the performance of the different algorithms.
2 An Abstract Collector This section presents the abstract collector algorithm. The algorithm is designed for maximum precision and ﬂexibility, and keeps much more information per object than would be practical in a realistic implementation. However, the space overhead is only a constant factor, and thus, does not affect the asymptotic complexity of the algorithm, while the additional information allows a potential reduction in complexity.
Similarly, a number of operations employed by the abstract algorithm also have constant time overheads that would be undesirable in a realistic collector. In particular, there is no special treatment of stack variables: they are assumed to be part of the heap and therefore every stack operation may incur a constant-time overhead for the collector to execute an associated barrier operation. There are a number of collectors for functional languages (such as ML and Haskell) that treat the stack in exactly this way.
Our generalized concurrent collection algorithm makes use of the framework of Bacon et al. : they showed that for synchronous (“stop the world”) garbage collection, tracing and reference counting can be considered as dual approaches to computing the reference count of an object. Tracing computes a least ﬁxpoint, and reference counting computes a greatest ﬁxpoint. The difference between the greatest and least ﬁxpoints is the cyclic garbage. In most practical tracing collectors, the reference count is collapsed into a single bit.
Furthermore, they showed that all collectors could be considered as a combination of tracing and reference counting, and that any incrementality is due to the use of a reference counting approach with its write barriers.
This insight is now extended to concurrent tracing collectors: we show that they are also a tracing/reference counting hybrid. The collector traces the original object graph as it existed at the time when collection started, but does reference counting for pointers to live objects that could be lost due to concurrent mutation.
The abstract algorithm makes use of the variables depicted in Table 1. In the discussion that follows, we elaborate more on the semantics of each shared variable.
2.1 Restrictions and Assumptions The algorithms we discuss are non-moving and concurrent, but not parallel. That is, the collector is single-threaded. The ideas derived from this discussion, however, are easily extendable to algorithms using multiple spaces, such as generational ones.
Furthermore, the algorithm performs synchronization with atomic sections rather than isolated atomic (compare-and-swap) operations. Atomic sections are relatively expensive on a multiprocessor, so that although the algorithm can be executed on a multiprocessor it is better suited to a uniprocessor system based on safe points, in which low-level atomicity is a by-product of the implementation style of the run-time system.
Additionally, we assume that the concurrency between the mutators and the collector is bounded by a single cycle. This is a common underlying assumption in most practical algorithms. Essentially, this means that all mutator operations started in collector cycle N ﬁnish in that cycle. They do not carry over to cycle N + 1, for example. No pipelining between the collector phases is assumed: sweeping is followed by marking.
Shared Variable Description Computed By Value Domain Global Variables Phase Current Collector phase Collector [Idle, Tracing, Sweeping] Hue Scanned Part of the Heap Collector [0, |H|] Per-Object Variables Marked Mark ﬂag Collector Boolean SRC Scanned reference count Mutator [0, |P|] Shade Scanning progress within object Collector [0, |N|] Recorded Recorded in buffer by barrier Mutator Boolean DontSweep Allocated after Hue Mutator Boolean Table 1. Shared variables, |N| is object size, |P| is maximum number of pointers in the heap and |H| is the number of objects in the heap.
For the sake of presentation, we also make a number of simplifying assumptions about the heap. We assume that all heap objects are the same size S and consist only of collector meta-data and object data ﬁelds which are all pointers. The ﬁelds of an object X are denoted X through X[S].
2.2 Tracing The abstract algorithm is shown in Fig 1 and 2. We begin by describing the outer collection loop and the tracing phase of collection cycle.
The Collect() procedure is invoked to perform a (concurrent) garbage collection. When it starts, the Phase of the collector is Idle, and the ﬁrst thing it does is to atomically mark the root object and set the collector phase to Tracing. Atomicity is required because mutators can perform operations dependent on the collection phase.
Because all variables live in the heap, there is only a single root that must be marked atomically. In a realistic collector that avoided write barriers on stack writes, this single operation would be replaced by atomic marking of all of the roots – which could be on stacks or on global variables.
The core of the algorithm is the invocation of Trace(), which is performed repeatedly until the concurrently executing mutators have not modiﬁed the object graph in a way that could result in unmarked live objects.
Tracing in our algorithm is very similar to the tracing in a synchronous collector: it repeatedly gets an object from the mark stack and scans it.
Shades of Grey In the Scan() procedure the ﬁrst major difference appears. Like a standard tracing collector, we iterate over the ﬁelds of the object and mark them.
However, as each ﬁeld is read, the Shade of the object is incremented.
The use of shades is one of the generalizations of our algorithm. Most concurrent collectors use the well-known tri-color abstraction: an object is white if it has not been seen by the collector, grey if it has been seen, but all of its ﬁelds have not been seen, and black if both it and its ﬁelds have been seen.
Collect() atomic Mark(root);
Phase = Tracing;
Obj.Recorded = true;
Fig. 2. Abstract Mutator CodeThe color of an object represents the progress of the tracing wavefront as it sweeps over the graph. However, the tri-color abstraction loses information because it does not track the progress of sweeping within the object. Fundamentally, the synchronization between the collector and the mutator depends on whether an object being mutated has been seen yet by the collector. Therefore, by losing information about the marking progress, the precision of the algorithm is compromised.
The Shade of an object is simply a generalization of the tri-color abstraction: objects are still white, grey, or black, but there are many shades of grey. The shade represents the exact progress of marking within the object. When Shade is 0, the object is white.
When it is the same as the number of ﬁelds in the object, the object is black. We will describe how the shade information is used when we present the write barrier executed by the mutator.
Once the Scan() procedure has updated the shade, it marks the target object. The Mark() procedure pushes the object onto the mark stack if it was not already marked.
2.3 Mutator Interaction We now turn to the interaction between the mutator and collector by considering the actions of the mutator when it changes the object graph. The connectivity graph can be modiﬁed by both pointer modiﬁcation and object allocation.
Write Barrier The write barrier is depicted by the procedure WriteBarrier() in Fig 2. In our presentation of the algorithm, the entire write barrier is atomic. Finergrained concurrency is possible, but is not discussed in this paper.
The write barrier takes a pointer to the object being modiﬁed, the ﬁeld in the object that is being modiﬁed, the new pointer that is being stored into the object, and a ﬂag indicating whether the new pointer refers to an object that was just allocated.