One of the Roslyn engineers who specializes in understanding optimization of stack usage took a look at this and reports to me that there seems to be a problem in the interaction between the way the C# compiler generates local variable stores and the way the JIT compiler does register scheduling in the corresponding x86 code. The result is suboptimal code generation on the loads and stores of the locals.

For some reason unclear to all of us, the problematic code generation path is avoided when the JITter knows that the block is in a try-protected region.

This is pretty weird. We'll follow up with the JITter team and see whether we can get a bug entered so that they can fix this.

Also, we are working on improvements for Roslyn to the C# and VB compilers' algorithms for determining when locals can be made "ephemeral" -- that is, just pushed and popped on the stack, rather than allocated a specific location on the stack for the duration of the activation. We believe that the JITter will be able to do a better job of register allocation and whatnot if we give it better hints about when locals can be made "dead" earlier.

Thanks for bringing this to our attention, and apologies for the odd behaviour.


Well, the way you're timing things looks pretty nasty to me. It would be much more sensible to just time the whole loop:

var stopwatch = Stopwatch.StartNew();
for (int i = 1; i < 100000000; i++)
{
    Fibo(100);
}
stopwatch.Stop();
Console.WriteLine("Elapsed time: {0}", stopwatch.Elapsed);

That way you're not at the mercy of tiny timings, floating point arithmetic and accumulated error.

Having made that change, see whether the "non-catch" version is still slower than the "catch" version.

EDIT: Okay, I've tried it myself - and I'm seeing the same result. Very odd. I wondered whether the try/catch was disabling some bad inlining, but using [MethodImpl(MethodImplOptions.NoInlining)] instead didn't help...

Basically you'll need to look at the optimized JITted code under cordbg, I suspect...

EDIT: A few more bits of information:

  • Putting the try/catch around just the n++; line still improves performance, but not by as much as putting it around the whole block
  • If you catch a specific exception (ArgumentException in my tests) it's still fast
  • If you print the exception in the catch block it's still fast
  • If you rethrow the exception in the catch block it's slow again
  • If you use a finally block instead of a catch block it's slow again
  • If you use a finally block as well as a catch block, it's fast

Weird...

EDIT: Okay, we have disassembly...

This is using the C# 2 compiler and .NET 2 (32-bit) CLR, disassembling with mdbg (as I don't have cordbg on my machine). I still see the same performance effects, even under the debugger. The fast version uses a try block around everything between the variable declarations and the return statement, with just a catch{} handler. Obviously the slow version is the same except without the try/catch. The calling code (i.e. Main) is the same in both cases, and has the same assembly representation (so it's not an inlining issue).

Disassembled code for fast version:

 [0000] push        ebp
 [0001] mov         ebp,esp
 [0003] push        edi
 [0004] push        esi
 [0005] push        ebx
 [0006] sub         esp,1Ch
 [0009] xor         eax,eax
 [000b] mov         dword ptr [ebp-20h],eax
 [000e] mov         dword ptr [ebp-1Ch],eax
 [0011] mov         dword ptr [ebp-18h],eax
 [0014] mov         dword ptr [ebp-14h],eax
 [0017] xor         eax,eax
 [0019] mov         dword ptr [ebp-18h],eax
*[001c] mov         esi,1
 [0021] xor         edi,edi
 [0023] mov         dword ptr [ebp-28h],1
 [002a] mov         dword ptr [ebp-24h],0
 [0031] inc         ecx
 [0032] mov         ebx,2
 [0037] cmp         ecx,2
 [003a] jle         00000024
 [003c] mov         eax,esi
 [003e] mov         edx,edi
 [0040] mov         esi,dword ptr [ebp-28h]
 [0043] mov         edi,dword ptr [ebp-24h]
 [0046] add         eax,dword ptr [ebp-28h]
 [0049] adc         edx,dword ptr [ebp-24h]
 [004c] mov         dword ptr [ebp-28h],eax
 [004f] mov         dword ptr [ebp-24h],edx
 [0052] inc         ebx
 [0053] cmp         ebx,ecx
 [0055] jl          FFFFFFE7
 [0057] jmp         00000007
 [0059] call        64571ACB
 [005e] mov         eax,dword ptr [ebp-28h]
 [0061] mov         edx,dword ptr [ebp-24h]
 [0064] lea         esp,[ebp-0Ch]
 [0067] pop         ebx
 [0068] pop         esi
 [0069] pop         edi
 [006a] pop         ebp
 [006b] ret

Disassembled code for slow version:

 [0000] push        ebp
 [0001] mov         ebp,esp
 [0003] push        esi
 [0004] sub         esp,18h
*[0007] mov         dword ptr [ebp-14h],1
 [000e] mov         dword ptr [ebp-10h],0
 [0015] mov         dword ptr [ebp-1Ch],1
 [001c] mov         dword ptr [ebp-18h],0
 [0023] inc         ecx
 [0024] mov         esi,2
 [0029] cmp         ecx,2
 [002c] jle         00000031
 [002e] mov         eax,dword ptr [ebp-14h]
 [0031] mov         edx,dword ptr [ebp-10h]
 [0034] mov         dword ptr [ebp-0Ch],eax
 [0037] mov         dword ptr [ebp-8],edx
 [003a] mov         eax,dword ptr [ebp-1Ch]
 [003d] mov         edx,dword ptr [ebp-18h]
 [0040] mov         dword ptr [ebp-14h],eax
 [0043] mov         dword ptr [ebp-10h],edx
 [0046] mov         eax,dword ptr [ebp-0Ch]
 [0049] mov         edx,dword ptr [ebp-8]
 [004c] add         eax,dword ptr [ebp-1Ch]
 [004f] adc         edx,dword ptr [ebp-18h]
 [0052] mov         dword ptr [ebp-1Ch],eax
 [0055] mov         dword ptr [ebp-18h],edx
 [0058] inc         esi
 [0059] cmp         esi,ecx
 [005b] jl          FFFFFFD3
 [005d] mov         eax,dword ptr [ebp-1Ch]
 [0060] mov         edx,dword ptr [ebp-18h]
 [0063] lea         esp,[ebp-4]
 [0066] pop         esi
 [0067] pop         ebp
 [0068] ret

In each case the * shows where the debugger entered in a simple "step-into".

EDIT: Okay, I've now looked through the code and I think I can see how each version works... and I believe the slower version is slower because it uses fewer registers and more stack space. For small values of n that's possibly faster - but when the loop takes up the bulk of the time, it's slower.

Possibly the try/catch block forces more registers to be saved and restored, so the JIT uses those for the loop as well... which happens to improve the performance overall. It's not clear whether it's a reasonable decision for the JIT to not use as many registers in the "normal" code.

EDIT: Just tried this on my x64 machine. The x64 CLR is much faster (about 3-4 times faster) than the x86 CLR on this code, and under x64 the try/catch block doesn't make a noticeable difference.


Jon's disassemblies show, that the difference between the two versions is that the fast version uses a pair of registers (esi,edi) to store one of the local variables where the slow version doesn't.

The JIT compiler makes different assumptions regarding register use for code that contains a try-catch block vs. code which doesn't. This causes it to make different register allocation choices. In this case, this favors the code with the try-catch block. Different code may lead to the opposite effect, so I would not count this as a general-purpose speed-up technique.

In the end, it's very hard to tell which code will end up running the fastest. Something like register allocation and the factors that influence it are such low-level implementation details that I don't see how any specific technique could reliably produce faster code.

For example, consider the following two methods. They were adapted from a real-life example:

interface IIndexed { int this[int index] { get; set; } }
struct StructArray : IIndexed { 
    public int[] Array;
    public int this[int index] {
        get { return Array[index]; }
        set { Array[index] = value; }
    }
}

static int Generic<T>(int length, T a, T b) where T : IIndexed {
    int sum = 0;
    for (int i = 0; i < length; i++)
        sum += a[i] * b[i];
    return sum;
}
static int Specialized(int length, StructArray a, StructArray b) {
    int sum = 0;
    for (int i = 0; i < length; i++)
        sum += a[i] * b[i];
    return sum;
}

One is a generic version of the other. Replacing the generic type with StructArray would make the methods identical. Because StructArray is a value type, it gets its own compiled version of the generic method. Yet the actual running time is significantly longer than the specialized method's, but only for x86. For x64, the timings are pretty much identical. In other cases, I've observed differences for x64 as well.


This looks like a case of inlining gone bad. On an x86 core, the jitter has the ebx, edx, esi and edi register available for general purpose storage of local variables. The ecx register becomes available in a static method, it doesn't have to store this. The eax register often is needed for calculations. But these are 32-bit registers, for variables of type long it must use a pair of registers. Which are edx:eax for calculations and edi:ebx for storage.

Which is what stands out in the disassembly for the slow version, neither edi nor ebx are used.

When the jitter can't find enough registers to store local variables then it must generate code to load and store them from the stack frame. That slows down code, it prevents a processor optimization named "register renaming", an internal processor core optimization trick that uses multiple copies of a register and allows super-scalar execution. Which permits several instructions to run concurrently, even when they use the same register. Not having enough registers is a common problem on x86 cores, addressed in x64 which has 8 extra registers (r9 through r15).

The jitter will do its best to apply another code generation optimization, it will try to inline your Fibo() method. In other words, not make a call to the method but generate the code for the method inline in the Main() method. Pretty important optimization that, for one, makes properties of a C# class for free, giving them the perf of a field. It avoids the overhead of making the method call and setting up its stack frame, saves a couple of nanoseconds.

There are several rules that determine exactly when a method can be inlined. They are not exactly documented but have been mentioned in blog posts. One rule is that it won't happen when the method body is too large. That defeats the gain from inlining, it generates too much code that doesn't fit as well in the L1 instruction cache. Another hard rule that applies here is that a method won't be inlined when it contains a try/catch statement. The background behind that one is an implementation detail of exceptions, they piggy-back onto Windows' built-in support for SEH (Structure Exception Handling) which is stack-frame based.

One behavior of the register allocation algorithm in the jitter can be inferred from playing with this code. It appears to be aware of when the jitter is trying to inline a method. One rule it appears to use that only the edx:eax register pair can be used for inlined code that has local variables of type long. But not edi:ebx. No doubt because that would be too detrimental to the code generation for the calling method, both edi and ebx are important storage registers.

So you get the fast version because the jitter knows up front that the method body contains try/catch statements. It knows it can never be inlined so readily uses edi:ebx for storage for the long variable. You got the slow version because the jitter didn't know up front that inlining wouldn't work. It only found out after generating the code for the method body.

The flaw then is that it didn't go back and re-generate the code for the method. Which is understandable, given the time constraints it has to operate in.

This slow-down doesn't occur on x64 because for one it has 8 more registers. For another because it can store a long in just one register (like rax). And the slow-down doesn't occur when you use int instead of long because the jitter has a lot more flexibility in picking registers.


I'd have put this in as a comment as I'm really not certain that this is likely to be the case, but as I recall it doesn't a try/except statement involve a modification to the way the garbage disposal mechanism of the compiler works, in that it clears up object memory allocations in a recursive way off the stack. There may not be an object to be cleared up in this case or the for loop may constitute a closure that the garbage collection mechanism recognises sufficient to enforce a different collection method. Probably not, but I thought it worth a mention as I hadn't seen it discussed anywhere else.