Java G1 garbage collection in production

I've been testing it out with a heavy application: 60-70GB allocated to heap, with 20-50GB in use at any time. With these sorts of applications, it's an understatement to say that your mileage may vary. I'm running JDK 1.6_22 on Linux. The minor versions are important-- before about 1.6_20, there were bugs in G1 that caused random NullPointerExceptions.

I've found that it is very good at keeping within the pause target you give it most of the time. The default appears to be a 100ms (0.1 second) pause, and I've been telling it to do half that (-XX:MaxGCPauseMillis=50). However, once it gets really low on memory, it panics and does a full stop-the-world garbage collection. With 65GB, that takes between 30 seconds and 2 minutes. (The number of CPUs probably doesn't make a difference; it's probably limited by the bus speed.)

Compared with CMS (which is not the default server GC, but it should be for web servers and other real-time applications), typical pauses are much more predictable and can be made much shorter. So far I'm having better luck with CMS for the huge pauses, but that may be random; I'm seeing them only a few times every 24 hours. I'm not sure which one will be more appropriate in my production environment at the moment, but probably G1. If Oracle keeps tuning it, I suspect G1 will ultimately be the clear winner.

If you're not having a problem with the existing garbage collectors, there's no reason to consider G1 right now. If you are running a low-latency application, such as a GUI application, G1 is probably the right choice, with MaxGCPauseMillis set really low. If you're running a batch-mode application, G1 doesn't buy you anything.


It sounds like the point of G1 is to have smaller pause times, even to the point where it has the ability to specify a maximum pause time target.

Garbage collection isn't just a simple "Hey, it's full, let's move everything at once and start over" deal any more--it's fantastically complex, multi-level, background threaded system. It can do much of its maintenance in the background with no pauses at all, and it also uses knowledge of the system's expected patterns at runtime to help--like assuming most objects die right after being created, etc.

I would say GC pause times are going to continue to improve, not worsen, with future releases.

EDIT:

in re-reading it occurred to me that I use Java daily--Eclipse, Azureus, and the apps I develop, and it's been a LONG TIME since I saw a pause. Not a significant pause, but I mean any pause at all.

I've seen pauses when I right-click on windows explorer or (occasionally) when I hook up certain USB hardware, but with Java---none at all.

Is GC still an issue with anyone?


Although I have not tested G1 in production, I thought I would comment that GCs are already problematic for cases without "humongous" heaps. Specifically services with just, say, 2 or 4 gigs can be severely impacted by GC. Young generation GCs are usually not problematic as they finish in single-digit milliseconds (or at most double-digit). But old-generation collections are much more problematic as they take multiple seconds with old-gen sizes of 1 gig or above.

Now: in theory CMS can help a lot there, as it can run most of its operation concurrently. However, over time there will be cases where it can not do this and has to fall back to "stop the world" collection. And when that happens (after, say, 1 hour -- not often, but still too often), well, hold on to your f***ing hats. It can take a minute or more. This is especially problematic for services that try to limit maximum latency; instead of it taking, say, 25 milliseconds to serve a request it now takes ten second or more. To add injury to insult clients will then often time out the request and retry, leading to further problems (aka "shit storm").

This is one area where G1 was hoped to help a lot. I worked for a big company that offers cloud services for storage and message dispatching; and we could not use CMS since although much of the time it worked better than parallel varieties, it had these meltdowns. So for about an hour things were nice; and then stuff hit the fan... and because service was based on clusters, when one node got in trouble, others typically followed (since GC-induced timeouts lead to other nodes believe node had crashed, leading to re-routes).

I don't think GC is that much of a problem for apps, and perhaps even non-clustered services are less often affected. But more and more systems are clustered (esp. thanks to NoSQL data stores) and heap sizes are growing. OldGen GCs are super-linearly related to heap size (meaning that doubling heap size more than doubles GC time, assuming size of live data set also doubles).


Azul's CTO, Gil Tene, has a nice overview of the problems associated with Garbage Collection and a review of various solutions in his Understanding Java Garbage Collection and What You Can Do about It presentation, and there's additional detail in this article: http://www.infoq.com/articles/azul_gc_in_detail.

Azul's C4 Garbage Collector in our Zing JVM is both parallel and concurrent, and uses the same GC mechanism for both the new and old generations, working concurrently and compacting in both cases. Most importantly, C4 has no stop-the-world fall back. All compaction is performed concurrently with the running application. We have customers running very large (hundreds of GBytes) with worse case GC pause times of <10 msec, and depending on the application often times less than 1-2 msec.

The problem with CMS and G1 is that at some point Java heap memory must be compacted, and both of those garbage collectors stop-the-world/STW (i.e. pause the application) to perform compaction. So while CMS and G1 can push out STW pauses, they don't eliminate them. Azul's C4, however, does completely eliminate STW pauses and that's why Zing has such low GC pauses even for gigantic heap sizes.

And to correct a statement made in an earlier answer, Zing does not require any changes to the Operating System. It runs just like any other JVM on unmodified Linux distros.


We are already using G1GC, from almost two years. Its doing great in our mission critical transaction processing system, and It proved to be a great support w.r.t high throughput, low pauses, concurrency and optimized heavy memory management.

We are using following JVM settings:

-server -Xms512m -Xmx3076m -XX:NewRatio=50 -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC -XX:+AggressiveOpts -XX:+UnlockExperimentalVMOptions -XX:MaxGCPauseMillis=400 -XX:GCPauseIntervalMillis=8000 -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime

Updated

-d64 -server -Xss4m -Xms1024m -Xmx4096m -XX:NewRatio=50 -XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:+HeapDumpOnOutOfMemoryError -XX:-DisableExplicitGC -XX:+AggressiveOpts -Xnoclassgc -XX:+UseNUMA -XX:+UseFastAccessorMethods -XX:ReservedCodeCacheSize=48m -XX:+UseStringCache -XX:+UseStringDeduplication -XX:MaxGCPauseMillis=400 -XX:GCPauseIntervalMillis=8000