Setting -XX:+DisableExplicitGC in production: what could go wrong?

we just had a meeting to address some performance issues in a web application that is used to calculate insurance rates. The calculations are implemented in a C/C++-module, that is used in other software packages as well. To make it available as a webservice, a Java wrapper was implemented that exposes an XML based interface and calls the C/C++-module via JNI.

Measurements showed that several seconds were spent on each calculation inside the Java part. So my first recomodation was to enable garbage collection logging in the VM. We could see at once that many stop-the-world full GCs were made. Talking about that, the developper of the java part told us they did a System.gc() on several occasions "to make sure the memory is released after use".

OK, I won't elaborate on that statement any further... ;-)

We then added abovementioned -XX:+DisableExplicitGC too the VMs arguments and reran the tests. This gained about 5 seconds per calculation.

Since we cannot change the code by stripping all those System.gc() calls at this point in our release process, we are thinking about adding -XX:+DisableExplicitGC in production until a new Jar can be created.

Now the question is: could there be any risk in doing so? About the only thing I can think of is tomcat using System.gc() internally when redeploying, but that's just a guess. Are there any other hazards ahead?


Solution 1:

You are not alone in fixing stop-the-world GC events by setting the -XX:+DisableExplicitGC flag. Unfortunately (and in spite of the disclaimers in the documentation), many developers decide they know better than the JVM when to collect memory and introduce exactly this type of issue.

I'm aware of many instances where the -XX:+DisableExplicitGC improved the production environment and zero instances where there were any negative side effects.

The safe thing to do is to run your current production code, under load, with that flag set in a stress test environment and perform a normal QA cycle.

If you cannot do that, I would suggest that the risk of setting the flag is less than the cost of not setting it in most cases.

Solution 2:

I've been wrestling with this same issue, and based on all the information I've been able to find there definitely appears to be some risk. Per the comments on your original post from @millimoose, as well as https://bugs.openjdk.java.net/browse/JDK-6200079 , it appears that setting -XX:+DisableExplicitGC would be a bad idea if the NIO direct buffers are being used. It appears that they are being used in the internal implementation of the Websphere 8.5 app server which we're using. Here's the stack trace I was able to capture while debugging this:

3XMTHREADINFO      "WebContainer : 25" J9VMThread:0x0000000006FC5D00, j9thread_t:0x00007F60E41753E0, java/lang/Thread:0x000000060B735590, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0xFE, isDaemon:true)
3XMTHREADINFO1            (native thread ID:0x1039, native priority:0x5, native policy:UNKNOWN)
3XMTHREADINFO2            (native stack address range from:0x00007F6067621000, to:0x00007F6067662000, size:0x41000)
3XMCPUTIME               CPU usage total: 80.222215853 secs
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=1594568 (0x1854C8)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at java/lang/System.gc(System.java:329)
4XESTACKTRACE                at java/nio/Bits.syncReserveMemory(Bits.java:721)
5XESTACKTRACE                   (entered lock: java/nio/Bits@0x000000060000B690, entry count: 1)
4XESTACKTRACE                at java/nio/Bits.reserveMemory(Bits.java:766(Compiled Code))
4XESTACKTRACE                at java/nio/DirectByteBuffer.<init>(DirectByteBuffer.java:123(Compiled Code))
4XESTACKTRACE                at java/nio/ByteBuffer.allocateDirect(ByteBuffer.java:306(Compiled Code))
4XESTACKTRACE                at com/ibm/ws/buffermgmt/impl/WsByteBufferPoolManagerImpl.allocateBufferDirect(WsByteBufferPoolManagerImpl.java:706(Compiled Code))
4XESTACKTRACE                at com/ibm/ws/buffermgmt/impl/WsByteBufferPoolManagerImpl.allocateCommon(WsByteBufferPoolManagerImpl.java:612(Compiled Code))
4XESTACKTRACE                at com/ibm/ws/buffermgmt/impl/WsByteBufferPoolManagerImpl.allocateDirect(WsByteBufferPoolManagerImpl.java:527(Compiled Code))
4XESTACKTRACE                at com/ibm/io/async/ResultHandler.runEventProcessingLoop(ResultHandler.java:507(Compiled Code))
4XESTACKTRACE                at com/ibm/io/async/ResultHandler$2.run(ResultHandler.java:905(Compiled Code))
4XESTACKTRACE                at com/ibm/ws/util/ThreadPool$Worker.run(ThreadPool.java:1864(Compiled Code))
3XMTHREADINFO3           Native callstack:
4XENATIVESTACK               (0x00007F61083DD122 [libj9prt26.so+0x13122])
4XENATIVESTACK               (0x00007F61083EA79F [libj9prt26.so+0x2079f])
....

Just what exactly the full ramifications are of setting -XX:+DisableExplicitGC when NIO direct byte buffers are being used isn't entirely clear to me yet (does this introduce a memory leak?), but there at least does appear to be some risk there. If you're using an app server other than Websphere you may want to verify that the app server itself isn't invoking System.gc() via NIO before disabling it. I've got a related question that will hopefully get some clarification on the exact impact on the NIO libraries here: Impact of setting -XX:+DisableExplicitGC when NIO direct buffers are used

Incidentally, Websphere also seems to manually invoke System.gc() several times during the boot process, usually twice within the first couple seconds after the app server is launched, and a third time within the first 1-2 minutes (possibly when the application is being deployed). In our case, this is why we started investigating in the first place, as it appears that all the System.gc() calls are coming directly from the app server, and never from our application code.

It should also be noted that in addition to the NIO libraries, the JDK internal implementation of RMI distributed garbage collection also calls System.gc(): Unexplained System.gc() calls due to Remote Method Invocation System.gc() calls by core APIs

Whether enabling -XX:+DisableExplicitGC will also wreak havoc with RMI DGC is also a little unclear to me. The only reference I've been able to find that even addresses this is the first reference above, which states

"However, in most cases regular GC activity is sufficient for effective DGC"

That 'in most cases' qualifier sounds awfully wishy-washy to me, so again, it seems like there's at least some risk is just shutting off all System.gc() calls, and you'd be better off fixing the calls in your code if at all possible and only shutting them off entirely as a last resort.