Stackoverflow doing boxing in C#

I have these two chunks of code in C#:

First

class Program
{
    static Stack<int> S = new Stack<int>();

    static int Foo(int n) {
        if (n == 0)
            return 0;
        S.Push(0);
        S.Push(1);
        ...
        S.Push(999);
        return Foo( n-1 );
    }
}

Second

class Program
{
    static Stack S = new Stack();

    static int Foo(int n) {
        if (n == 0)
            return 0;
        S.Push(0);
        S.Push(1);
        ...
        S.Push(999);
        return Foo( n-1 );
    }
}

They both do the same:

  1. Create a Stack (generic in <int> for the first example and a stack of object for the second one).

  2. Declare a method that calls itself recursively n times (n >= 0) and in every step push 1000 integers inside the created stack.

When I run the first example with Foo(30000) no exception occurs, however the second example crashes with Foo(1000), just n = 1000.

When I saw the CIL generated for both cases the only difference was the boxing part for every push:

First

IL_0030:  ldsfld     class [System]System.Collections.Generic.Stack`1<int32> Test.Program::S
IL_0035:  ldc.i4     0x3e7
IL_003a:  callvirt   instance void class [System]System.Collections.Generic.Stack`1<int32>::Push(!0)
IL_003f:  nop

Second

IL_003a:  ldsfld     class [mscorlib]System.Collections.Stack Test.Program::S
IL_003f:  ldc.i4     0x3e7
IL_0044:  box        [mscorlib]System.Int32
IL_0049:  callvirt   instance void [mscorlib]System.Collections.Stack::Push(object)
IL_004e:  nop

My question is: Why, if there is not significant overload of CIL's stack for the second example, does it crash "faster" than the first one?


Why, if there is not significant overload of CIL's stack for second example, does it crash "faster" than the first one?

Note that the number of CIL instructions does not accurately represent the amount of work or memory that will be used. A single instruction can be very low impact, or very high impact, so counting CIL instructions is not an accurate way to measure "work".

Also realize that the CIL is not what gets executed. The JIT compiles the CIL to actual machine instructions, with an optimization phase, so the CIL may be very different than the actual executed instructions.

In the second case, since you're using a non-generic collection, every Push call requires the integer to be boxed, as you determined in CIL.

Boxing an integer effectively creates an object which "wraps" the Int32 for you. Instead of just loading a 32-bit integer onto the stack, it now has to load a 32-bit integer onto the stack, then box it, which effectively also loads an object reference onto the stack.

If you inspect this in the Disassembly window, you can see the difference between the generic and non-generic version is dramatic, and much more significant than the generated CIL would suggest.

The generic version effectively compiles to as a series of calls like so:

0000022c  nop 
            S.Push(25);
0000022d  mov         ecx,dword ptr ds:[03834978h] 
00000233  mov         edx,19h 
00000238  cmp         dword ptr [ecx],ecx 
0000023a  call        71618DD0 
0000023f  nop 
            S.Push(26);
00000240  mov         ecx,dword ptr ds:[03834978h] 
00000246  mov         edx,1Ah 
0000024b  cmp         dword ptr [ecx],ecx 
0000024d  call        71618DD0 
00000252  nop 
            S.Push(27);

The non-generic, on the other hand, has to create the boxed objects, and instead compiles to:

00000645  nop 
            S.Push(25);
00000646  mov         ecx,7326560Ch 
0000064b  call        FAAC20B0 
00000650  mov         dword ptr [ebp-48h],eax 
00000653  mov         eax,dword ptr ds:[03AF4978h] 
00000658  mov         dword ptr [ebp+FFFFFEE8h],eax 
0000065e  mov         eax,dword ptr [ebp-48h] 
00000661  mov         dword ptr [eax+4],19h 
00000668  mov         eax,dword ptr [ebp-48h] 
0000066b  mov         dword ptr [ebp+FFFFFEE4h],eax 
00000671  mov         ecx,dword ptr [ebp+FFFFFEE8h] 
00000677  mov         edx,dword ptr [ebp+FFFFFEE4h] 
0000067d  mov         eax,dword ptr [ecx] 
0000067f  mov         eax,dword ptr [eax+2Ch] 
00000682  call        dword ptr [eax+18h] 
00000685  nop 
            S.Push(26);
00000686  mov         ecx,7326560Ch 
0000068b  call        FAAC20B0 
00000690  mov         dword ptr [ebp-48h],eax 
00000693  mov         eax,dword ptr ds:[03AF4978h] 
00000698  mov         dword ptr [ebp+FFFFFEE0h],eax 
0000069e  mov         eax,dword ptr [ebp-48h] 
000006a1  mov         dword ptr [eax+4],1Ah 
000006a8  mov         eax,dword ptr [ebp-48h] 
000006ab  mov         dword ptr [ebp+FFFFFEDCh],eax 
000006b1  mov         ecx,dword ptr [ebp+FFFFFEE0h] 
000006b7  mov         edx,dword ptr [ebp+FFFFFEDCh] 
000006bd  mov         eax,dword ptr [ecx] 
000006bf  mov         eax,dword ptr [eax+2Ch] 
000006c2  call        dword ptr [eax+18h] 
000006c5  nop 

Here you can see the significance of boxing.

In your case, boxing the integer causes the boxed object references to get loaded onto the stack. On my system, this is causing a stackoverflow on any calls larger than Foo(127) (in 32 bit), which suggests that the integers and boxed object references (4 bytes each) are all being kept on the stack, as 127*1000*8==1016000, which is dangerously close to the default 1 MB thread stack size for .NET applications.

When using the generic version, since there is no boxed object, the integers aren't having to all be stored on the stack, and the same register is being reused. This allows you to recurse significantly more (>40000 on my system) before you use up the stack.

Note that this will be CLR version and platform dependent, as there is a different JIT on x86/x64, as well.