Ways to improve performance consistency

I'm not an expert in the area of processor caches but I suspect your issue is essentially a cache issue or some other memory layout problem. Repeated allocation of the buffers and counters without cleaning up the old objects may be causing you to periodically get a very bad cache layout, which may lead to your inconsistent performance.

Using your code and making a few mods I have been able to make the performance consistent (my test machine is Intel Core2 Quad CPU Q6600 2.4GHz w/ Win7x64 - so not quite the same but hopefully close enough to have relevant results). I've done this in two different ways both of which have roughly the same effect.

First, move creation of the buffers and counters outside of the doTest method so that they are created only once and then reused for each pass of the test. Now you get the one allocation, it sits nicely in the cache and performance is consistent.

Another way to get the same reuse but with "different" buffers/counters was to insert a gc after the performTiming loop:

for ( int i = 0; i < 3; i++ )
    performTiming ( writeBuffer, readBuffer, readCount, writeCount );
System.out.println ();
System.gc ();

Here the result is more or less the same - the gc lets the buffers/counters be reclaimed and the next allocation ends up reusing the same memory (at least on my test system) and you end up in cache with consistent performance (I also added printing of the actual addresses to verify reuse of the same locations). My guess is that without the clean up leading to reuse you eventually end up with a buffer allocated that doesn't fit into the cache and your performance suffers as it is swapped in. I suspect that you could do some strange things with order of allocation (like you can make the performance worse on my machine by moving the counter allocation in front of the buffers) or creating some dead space around each run to "purge" the cache if you didn't want to eliminate the buffers from a prior loop.

Finally, as I said, processor cache and the fun of memory layouts aren't my area of expertise so if the explanations are misleading or wrong - sorry about that.


you are busy waiting. that is always a bad idea in user code.

reader:

while ((toRead = writeCount.get() - rc) <= 0) ;

writer:

while (wc - readCount.get() > 0) ;

As a general approach to performance analysis:

  • Try jconsole. Start your app, and while it's running type jconsole in separate terminal window. This will bring up the Java Console GUI, which allows you to connect to a running JVM, and see performance metrics, memory usage, Thread count and status, etc.
  • Basically you're going to have to figure out the correlation between the speed fluxuations and what you see the JVM doing. It could also be helpful to bring up your task manager and see if your system is actually just busy doing other stuff (paging to the disk due to low memory, busy with a heavy background task, etc.) and put it side-by-side with the jconsole window.
  • One other alternative is launching the JVM with the -Xprof option which outputs relative time spent in various methods on a per-thread basis. Ex. java -Xprof [your class file]
  • Finally, there is also JProfiler, but it's a commercial tool, if that matters to you.

EDIT: It appears that triggering a GC will shift the behaviour. These show repeated test on the same buffer+counters with a manually trigger GC halfway.

GC means reaching a safepoint which means all threads have stopped executing bytecode & the GC threads have work to do. This can have various side effects. For example, in the absence of any explicit cpu affinity, you may restart execution on a different core or cache lines may have been refreshed. Can you track which cores your threads are running on?

What CPUs are these? Have you done anything about power management to prevent them dropping down into lower p and/or c states? Perhaps 1 thread is being scheduled onto a core that was in a different p state hence shows a different performance profile.

EDIT

I tried running your test on a workstation running x64 linux with 2 slightly old quadcore xeons (E5504), it's generally consistent within a run (~17-18M/s) with occasion runs much slower which appears to generally correspond with thread migrations. I didn't plot this rigorously. Therefore it appears your problem might be CPU architecture specific. You mention you're running an i7 at 4.6GHz, is that a typo? I thought the i7 topped out at 3.5GHz with a 3.9Ghz turbo mode (with an earlier version 3.3GHz to 3.6GHz turbo). Either way are you sure you're not seeing an artifact of turbo mode kicking in then dropping out? You could try repeating the test with turbo disabled to be sure.

A couple of other points

  • the padding values are all 0, are you sure there isn't some special treatment being meted out to uninitialised values? you could consider using the LogCompilation option to understand how the JIT is treating that method
  • Intel VTune is free for 30 day evaluation, if this is a cache line problem then you could use that to determine what the problem is on your host