Why is memory allocation on heap MUCH slower than on stack?

Solution 1:

Because the heap is a far more complicated data structure than the stack.

For many architectures, allocating memory on the stack is just a matter of changing the stack pointer, i.e. it's one instruction. Allocating memory on the heap involves looking for a big enough block, splitting it, and managing the "book-keeping" that allows things like free() in a different order.

Memory allocated on the stack is guaranteed to be deallocated when the scope (typically the function) exits, and it's not possible to deallocate just some of it.

Solution 2:

In your edit where you restate unwind's answer you mention the "heap data structure". Be very careful as the data structure known as a heap has no relationship to dynamic memory allocation. To be very clear, I'll use the more language lawyer terminology of free store.

As has already been pointed out, stack allocation requires incrementing a pointer, which typically has a dedicated register on most architectures and deallocation requires the same amount of work. Stack allocations are also scoped to a particular function. This makes them much better candidates for compiler optimizations like precomputing the total space needed on the stack and doing a single increment for an entire stack frame. Likewise, the stack has better guaranteed data locality. The top of the stack is almost always guaranteed to be inside a cache line, and as I mentioned already the stack pointer is typically stored in a register. Optimizing compilers on some architectures can even eliminate allocations altogether on the stack by reusing arguments from previous stack frames that are passed as arguments to called functions in deeper stack frames. Likewise stack allocated variables can often be promoted to registers avoiding allocations as well.

In contrast, the free store is much more complex. I'm not even going to begin to cover garbage collection systems as that's a whole different topic, and this question was asked about the C language. Typically allocations and deallocations from a free store involve several different data structures like a free list or block pool. These data structures and bookkeeping require memory as well, and thus that space is wasted. Furthermore, the bookkeeping records are often intermixed with the allocations and thus hurt the data locality of other allocations. Allocations from the free store may involve asking the underlying operating system for more process memory typically from some form of slab allocator.

For a simple comparison, and using jemalloc-2.2.5 and numbers from sloccount as a reference, the jemalloc implementation contains over 8,800 lines of source code in the C language and another over 700 lines of test code. This should give you a good idea of the difference in complexity between free store allocation and stack allocation: thousands of lines of C code versus a single instruction.

Additionally, since free store allocations are not limited to a single lexical scope, the lifetime of every allocation must be tracked. Likewise, these allocations may be passed across threads, and thus thread synchronization issues enter the problem space. Another big problem for free store allocation is fragmentation. Fragmentation causes many problems:

  • Fragmentation hurts data locality.
  • Fragmentation wastes memory.
  • Fragmentation makes the job of finding free space for large allocations harder.

On modern systems, stacks are often relatively small in comparison to the free store, so ultimately the free store is managing more space and thus tackling a harder problem. Also, due to the limitations on stack sizes, the free store is typically used for larger allocations, this discrepancy between having to handle both very large and very small allocations makes the job of the free store harder as well. Typically stack allocations are small on the order of a few kilobytes or less, and the total size of the stack is only a few megabytes. The free store is generally given the entire rest of process space in a program. On modern machines this can be several hundred gigabytes, and it's not uncommon for free store allocations to vary in size from a few bytes like a short string of characters to megabytes or even gigabytes of arbitrary data. This means that free store allocators have to deal with the underlying operating system's virtual memory management. Stack allocation is essentially built-in to the computer hardware.

If you want to really learn about free store allocation, I would highly recommend reading some of the many papers and articles published about various malloc implementations or even reading the code. Here's a few links to get you started:

  • dlmalloc - Doug Lea's malloc, a historical reference malloc implementation used in GNU C++ at one point in time
  • phkmalloc - FreeBSD implementation of malloc written by Poul-Henning Kamp author of the Varnish web cache
  • tcmalloc - Thread-Caching Malloc implemented by some Google developers
  • jemalloc - Jason Evan's malloc implementation for FreeBSD (successor of phkmalloc)

Here's some additional links with descriptions of the tcmalloc implementation:

  • http://jamesgolick.com/2013/5/15/memory-allocators-101.html
  • http://jamesgolick.com/2013/5/19/how-tcmalloc-works.html