Explanation of Azul's "pauseless" garbage collector

I've just read this:

http://www.artima.com/lejava/articles/azul_pauseless_gc.html

Although I've some experience with compilers, I've done nothing related with garbage collection; is a big black box to me.

I've struggled to understand the issues mentioned by the article. I understand the problem (there's a pause when executing most garbage collectors), and I understand that they claim that their implementation doesn't have that problem. But I don't understand why/how the problem happens in the first place (that much seems to be assumed to be understood on the original text), and in consequence I don't get why their solution might work.

Can someone explain to me:

  1. why garbage collectors have that pause in general
  2. why Azul's gc doesn't have that problem?

I tend to understand this kind of things better when explained graphically - probably a small memory schema done with the code editor would suffice.

Thanks!


They talk about the pause that inevitably occurs when compacting the heap. You see, when you allocate and deallocate lots of objects of different sizes as you go, you fragment the heap (much like you fragment your harddrive). When fragmentation becomes too extreme, you have to clean up/defragment/compact the heap by reserving a huge chunk of memory, moving all objects there (without any fragmentation) and use their former locations as a fresh chunk of memory without any objects in it, i.e. without fragmentation.

When you do that, you invalidate all references to all objects you moved around. To prevent this, you must prevent that a reference that refers to a pre-compaction object location is used. The by far easiest way to do so is to pause the whole application, move the objects around and then go and update all references. Of course this can incur a significant overhead.

So the solution Azul proposes goes like this: They establish a "read barrier" that allows the GC to intercept dereferencing, and this way they can lazily update the references that are actually used.


why garbage collectors have that pause in general?

GCs work by tracing reachable heap blocks starting from a set of global roots (global variables, thread stacks and CPU registers). GCs lie on a sliding scale from snapshot to on-the-fly. Snapshot GCs work from a snapshot of the global roots and heap topology. On-the-fly GCs incrementally update their interpretation of the heap as the mutators run.

Tracing is not "why garbage collectors have that pause in general", there are bigger reasons for pausing: object relocation being the dominant one.

But to the point of on-the-fly vs. snapshot tracers and their relative efficiency: On-the-fly tracing can bemore efficient that snapshot-at-the-beginning tracers. The same paper you refer to in describing [VCGC] classifies Azul's previous, non-generational Pauseless collector as a precise wavefront [3] tracer:

"... Most practical collectors use conservative abstractions of the wavefront rather than the precise definition provided here. That is, the wavefront is tracked at an object granularity. However, the precise wavefront is not merely theoretical and has recently been used in the hardware-assisted collector for the Azul Java server, which has a “not marked-through” bit in every pointer [2]."

Azul's C4 shares this quality, but achieves it using a pure-software, self-healing LVB read barrier.

Wholly snapshot GCs attain high throughput because the collector runs almost entirely independently of the mutators but have high latency because taking a snapshot incurs a pause. Wholly on-the-fly GCs attain low latency because everything is done incrementally but low throughput because of the fine-grained communication between the mutators and GC.

A precise wavefront tracer is as efficient (from the "not spending time on unneeded objects" point of view discussed in the paper) as a stop-the-world tracer, which by definition also have a recise wavefront. Compared to snapshot-based approaches, precise wavefront scanning doesn't slow throughput in any way, or require any more communication between the collector and the mutator. It has as-good-or-better throughput, as it's preciseness assures that it never has to repeat any tracing work.

why Azul's gc doesn't have that problem?

They do still have this problem but they lessened it by implementing a read barrier in hardware. Read barriers had been proposed before but software read barriers degrade throughput too much because pointer reads are much more common than writes.

As noted, if "the problem" were inefficiency due to on-the-fly vs. snapshot behavior, then C4 doesn't have it because it is a precise wavefront tracer. Furthermore, Azul's C4 collector doesn't need or use a hardware read barrier, as it runs on vanilla x86 and Linux systems, and achieves better throughput on that hardware that snapshot based tracing collectors do (see throughput comparisons in [1])...

However, "the problem" in the question was stated as "why garbage collectors have that pause in general?" Wavefront preciseness (or not) has little to do the dominant pauses in garbage collectors. Concurrent and mostly-concurrent (even if less efficient than C4) markers do exist, but their collectors still pause. The issue is that tracing is only a part of collection. Tracing only tells you what is alive and where it is. It doesn't give you any memory back, and it certainly doesn't de-fragment fragmented memory. That subject is discussed in depth in various academic papers (see slew of references from C4 paper [1]).

It is compaction (and the implied object relocation) that seems to be the achilles heel of currently shipping collectors on server JVMs, and the thing that inherently causes them to take [large] pauses. The simple act of relocating even a single object from one place to another means that you have to FIX all references pointing to that object before the program uses them. For most commercially shipping collectors, this means a stop-the-world pause that keeps the application from running while references are being fixed.

C4 leverages the self-healing LVB barrier (a new type of read barrier introduced in [2] and used heavily in [1] in software-only form) to avoid having to fix references before the application is allowed to run. That's how it avoids the pause that other collectors end up having to take. The self-healing quality reduces the dynamic cost of the read barriers by several order of magnitude compared to previous, non-self-healing barriers (such as the brooks-style barrier used in other concurrent compactors in academic works, and in some real-time collectors). The result of this dramatically lower read-barrier cost is that it is practical for use in generational collection and on server-scale JVMs.

[1]: "C4: the continuously concurrent compacting collector"http://dl.acm.org/citation.cfm?id=1993491&dl=ACM&coll=DL&CFID=85063603&CFTOKEN=84074207 [2]: "The Pauseless GC Algorithm"http://static.usenix.org/events/vee05/full_papers/p46-click.pdf [3]: "Correctness-Preserving Derivation of Concurrent Garbage Collection Algorithms" www.srl.inf.ethz.ch/papers/pldi06-cgc.pdf

(Graham Thomas, EMEA Technical Manager, Azul Systems)


why garbage collectors have that pause in general?

GCs work by tracing reachable heap blocks starting from a set of global roots (global variables, thread stacks and CPU registers). GCs lie on a sliding scale from snapshot to on-the-fly. Snapshot GCs work from a snapshot of the global roots and heap topology. On-the-fly GCs incrementally update their interpretation of the heap as the mutators run.

Wholly snapshot GCs attain high throughput because the collector runs almost entirely independently of the mutators but have high latency because taking a snapshot incurs a pause. Wholly on-the-fly GCs attain low latency because everything is done incrementally but low throughput because of the fine-grained communication between the mutators and GC.

In practice, all GCs lie somewhere between these two extremes. VCGC is primarily a snapshot GC but it uses a write barrier to keep the collector apprised of changes to the heap topology. Staccato was the world's first parallel and concurrent and real-time GC but it still batches up some operations in order to retain the efficiency of stack allocation.

why Azul's gc doesn't have that problem?

They do still have this problem but they lessened it by implementing a read barrier in hardware. Read barriers had been proposed before but software read barriers degrade throughput too much because pointer reads are much more common than writes.