Weird performance increase in simple benchmark

There is a very simple way to always get the "fast" version of your program. Project > Properties > Build tab, untick the "Prefer 32-bit" option, ensure that the Platform target selection is AnyCPU.

You really don't prefer 32-bit, unfortunately is always turned on by default for C# projects. Historically, the Visual Studio toolset worked much better with 32-bit processes, an old problem that Microsoft has been chipping away at. Time to get that option removed, VS2015 in particular addressed the last few real road-blocks to 64-bit code with a brand-new x64 jitter and universal support for Edit+Continue.

Enough chatter, what you discovered is the importance of alignment for variables. The processor cares about it a great deal. If a variable is mis-aligned in memory then the processor has to do extra work to shuffle the bytes to get them in the right order. There are two distinct misalignment problems, one is where the bytes are still inside a single L1 cache line, that costs an extra cycle to shift them into the right position. And the extra bad one, the one you found, where part of the bytes are in one cache line and part in another. That requires two separate memory accesses and gluing them together. Three times as slow.

The double and long types are the trouble-makers in a 32-bit process. They are 64-bits in size. And can get thus get misaligned by 4, the CLR can only guarantee a 32-bit alignment. Not a problem in a 64-bit process, all variables are guaranteed to be aligned to 8. Also the underlying reason why the C# language cannot promise them to be atomic. And why arrays of double are allocated in the Large Object Heap when they have more than a 1000 elements. The LOH provides an alignment guarantee of 8. And explains why adding a local variable solved the problem, an object reference is 4 bytes so it moved the double variable by 4, now getting it aligned. By accident.

A 32-bit C or C++ compiler does extra work to ensure that double cannot be misaligned. Not exactly a simple problem to solve, the stack can be misaligned when a function is entered, given that the only guarantee is that it is aligned to 4. The prologue of such a function need to do extra work to get it aligned to 8. The same trick doesn't work in a managed program, the garbage collector cares a great deal about where exactly a local variable is located in memory. Necessary so it can discover that an object in the GC heap is still referenced. It cannot deal properly with such a variable getting moved by 4 because the stack was misaligned when the method was entered.

This is also the underlying problem with .NET jitters not easily supporting SIMD instructions. They have much stronger alignment requirements, the kind that the processor cannot solve by itself either. SSE2 requires an alignment of 16, AVX requires an alignment of 32. Can't get that in managed code.

Last but not least, also note that this makes the perf of a C# program that runs in 32-bit mode very unpredictable. When you access a double or long that's stored as a field in an object then perf can drastically change when the garbage collector compacts the heap. Which moves objects in memory, such a field can now suddenly get mis/aligned. Very random of course, can be quite a head-scratcher :)

Well, no simple fixes but one, 64-bit code is the future. Remove the jitter forcing as long as Microsoft won't change the project template. Maybe next version when they feel more confident about Ryujit.

Update 4 explains the problem: in the first case, JIT keeps the calculated values (a, b) on the stack; in the second case, JIT keeps it in the registers.

In fact, Test1 works slowly because of the Stopwatch. I wrote the following minimal benchmark based on BenchmarkDotNet:

[BenchmarkTask(platform: BenchmarkPlatform.X86)]
public class Jit_RegistersVsStack
{
    private const int IterationCount = 100001;

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithoutStopwatch()
    {
        double a = 1, b = 1;
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // faddp       st(1),st
            a = a + b;
        }
        return string.Format("{0}", a);
    }

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithStopwatch()
    {
        double a = 1, b = 1;
        var sw = new Stopwatch();
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // fadd        qword ptr [ebp-14h]
            // fstp        qword ptr [ebp-14h]
            a = a + b;
        }
        return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
    }

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithTwoStopwatches()
    {
        var outerSw = new Stopwatch();
        double a = 1, b = 1;
        var sw = new Stopwatch();
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // faddp       st(1),st
            a = a + b;
        }
        return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
    }
}

The results on my computer:

BenchmarkDotNet=v0.7.7.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz, ProcessorCount=8
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit  [RyuJIT]
Type=Jit_RegistersVsStack  Mode=Throughput  Platform=X86  Jit=HostJit  .NET=HostFramework

             Method |   AvrTime |    StdDev |       op/s |
------------------- |---------- |---------- |----------- |
   WithoutStopwatch | 1.0333 ns | 0.0028 ns | 967,773.78 |
      WithStopwatch | 3.4453 ns | 0.0492 ns | 290,247.33 |
 WithTwoStopwatches | 1.0435 ns | 0.0341 ns | 958,302.81 |

As we can see:

WithoutStopwatch works quickly (because a = a + b uses the registers)
WithStopwatch works slowly (because a = a + b uses the stack)
WithTwoStopwatches works quickly again (because a = a + b uses the registers)

Behavior of JIT-x86 depends on big amount of different conditions. For some reason, the first stopwatch forces JIT-x86 to use the stack, and the second stopwatch allows it to use the registers again.

Weird performance increase in simple benchmark

Related

Recent Posts