How serious is the Java7 "Solr/Lucene" bug?

Apparently Java7 has some nasty bug regarding loop optimization: Google search.

From the reports and bug descriptions I find it hard to judge how significant this bug is (unless you use Solr or Lucene).

What I'd like to know:

  • How likely is it that my (any) program is affected?
  • Is the bug deterministic enough that normal testing will catch it?

Note: I can't make users of my program use -XX:-UseLoopPredicate to avoid the problem.


The problem with any hotspot bugs, is that you need to reach the compilation threshold (e.g. 10000) before it can get you: so if your unit tests are "trivial", you probably won't catch it.

For example, we caught the incorrect results issue in lucene, because this particular test creates 20,000 document indexes.

In our tests we randomize different interfaces (e.g. different Directory implementations) and indexing parameters and such, and the test only fails 1% of the time, of course its then reproducable with the same random seed. We also run checkindex on every index that tests create, which do some sanity tests to ensure the index is not corrupt.

For the test we found, if you have a particular configuration: e.g. RAMDirectory + PulsingCodec + payloads stored for the field, then after it hits the compilation threshold, the enumeration loop over the postings returns incorrect calculations, in this case the number of returned documents for a term != the docFreq stored for the term.

We have a good number of stress tests, and its important to note the normal assertions in this test actually pass, its the checkindex part at the end that fails.

The big problem with this, is that lucene's incremental indexing fundamentally works by merging multiple segments into one: because of this, if these enums calculate invalid data, this invalid data is then stored into the newly merged index: aka corruption.

I'd say this bug is much sneakier than previous loop optimizer hotspot bugs we have hit (e.g. sign-flip stuff, https://issues.apache.org/jira/browse/LUCENE-2975). In that case we got wacky negative document deltas, which make it easy to catch. We also only had to manually unroll a single method to dodge it. On the other hand, the only "test" we had initially for that was a huge 10GB index of http://www.pangaea.de/, so it was painful to narrow it down to this bug.

In this case, I spent a good amount of time (e.g. every night last week) trying to manually unroll/inline various things, trying to create some workaround so we could dodge the bug and not have the possibility of corrupt indexes being created. I could dodge some cases, but there were many more cases I couldn't... and I'm sure if we can trigger this stuff in our tests there are more cases out there...


Simple way to reproduce the bug. Open eclipse (Indigo in my case), and Go to Help/Search. Enter a search string, you will notice that eclipse crashes. Have a look at the log.

# Problematic frame:
# J  org.apache.lucene.analysis.PorterStemmer.stem([CII)Z
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x0000000007b79000):  JavaThread "Worker-46" [_thread_in_Java, id=264, stack(0x000000000f380000,0x000000000f480000)]

siginfo: ExceptionCode=0xc0000005, reading address 0x00000002f62bd80e

Registers:

The problem, still exist as of Dec 2, 2012 in both Oracle JDK java -version java version "1.7.0_09" Java(TM) SE Runtime Environment (build 1.7.0_09-b05) Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode) and openjdk java version "1.7.0_09-icedtea" OpenJDK Runtime Environment (fedora-2.3.3.fc17.1-x86_64) OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)

Strange that individually any of -XX:-UseLoopPredicate or -XX:LoopUnrollLimit=1 option prevent bug from happening, but when used together - JDK fails see e.g. https://bugzilla.redhat.com/show_bug.cgi?id=849279


Well it's two years later and I believe this bug (or a variation of it) is still present in 1.7.0_25-b15 on OSX.

Through very painful trial and error I have determined that using Java 1.7 with Solr 3.6.2 and autocommit <maxTime>30000</maxTime> seems to cause index corruption. It only seems to happen w/ 1.7 and maxTime at 30000- if I switch to Java 1.6, I have no problems. If I lower maxTime to 3000, I have no problems.

The JVM does not crash, but it causes RSolr to die with the following stack trace in Ruby: https://gist.github.com/armhold/6354416. It does this reliably after saving a few hundred records.

Given the many layers involved here (Ruby, Sunspot, Rsolr, etc) I'm not sure I can boil this down into something that definitively proves a JVM bug, but it sure feels like that's what's happening here. FWIW I have also tried JDK 1.7.0_04, and it also exhibits the problem.