What the heque is going on with the memory overhead of std::deque?

I am working on an external sorting algorithm that uses std::queue and must carefully constrain its memory usage. I have noticed that during the merge phase (which uses several std::queues of fixed length), my memory usage increases to about 2.5X what I expected. Since std::queue by default uses std::deque as its underlying container, I ran some tests on std::deque to determine its memory overhead. Here are the results, running on VC++ 9, in release mode, with a 64-bit process:

When adding 100,000,000 chars to a std::deque, the memory usage grows to 252,216K. Note that 100M chars (1 byte) should occupy 97,656K, so this is an overhead of 154,560K.

I repeated the test with doubles (8 bytes) and saw memory grow to 1,976,676K, while 100M doubles should occupy 781,250K, for an overhead of 1,195,426K!!

Now I understand that std::deque is normally implemented as a linked list of "chunks." If this is true, then why is the overhead proportional to the element size (because of course the pointer size should be fixed at 8 bytes)? And why is it so danged huge?

Can anybody shed some light on why std::deque uses so much danged memory? I'm thinking I should switch my std::queue underlying containers to std::vector as there is no overhead (given that the appropriate size is reserveed). I'm thinking the benefits of std::deque are largely negated by the fact that it has such a huge overhead (resulting in cache misses, page faults, etc.), and that the cost of copying std::vector elements may be less, given that the overall memory usage is so much lower. Is this just a bad implementation of std::deque by Microsoft?


Solution 1:

Look at the code for _DEQUESIZ (number of elements per block):

#define _DEQUESIZ   (sizeof (_Ty) <= 1 ? 16 \
    : sizeof (_Ty) <= 2 ? 8 \
    : sizeof (_Ty) <= 4 ? 4 \
    : sizeof (_Ty) <= 8 ? 2 : 1)    /* elements per block (a power of 2) */

It gets smaller if the element is larger. Only for elements larger than 8 bytes will you get the expected behavior (percentual decrease of overhead with increase of element size).

Solution 2:

Is it possible that you are running Debug binaries? 252MB for 100M chars does seem like a lot...

You can check attribution of this using umdh to snapshot before and after and then compare the two - might shed some light on why it's larger than you expected.

EDIT: FYI - When I run this outside the debugger on VS2010 I get 181MB with chars.

deque<char> mydequeue;
for (size_t i = 0; i < 100 * 1024 * 1024; ++i)
{
  mydequeue.push_back(char(i));
}

EDIT: Supporting the other answer from @Dialecticus, this gives me the same footprint as double:

struct twoInt64s
{
public:
    twoInt64s(__int64 _a, __int64 _b) : a(_a), b(_b) {}

    __int64 a;
    __int64 b;
};

EDIT: With _DEQUESIZ modified as shown (128 chars per block), 100M chars now takes 113M of memory.

My conclusion is that the remaining overhead you saw is due to management structures for the deque blocks, which have 16 chars of data, plus control info for deque plus more control info for heap manager.

#define _DEQUESIZ   (sizeof (value_type) <= 1 ? 128 \
    : sizeof (value_type) <= 2 ? 8 \
    : sizeof (value_type) <= 4 ? 4 \
    : sizeof (value_type) <= 8 ? 2 \
    : 1)    /* elements per block (a power of 2) */

Moral - if you really want to optimize this for your special purpose, be prepared to play with <deque>. Its behaviour depends critically on the size of your elements, and beyond that on the expected usage pattern.

EDIT: Depending on your knowledge of queue sizes, you might be able to drop in boost::circular_buffer as a replacement for the std::queue container. I bet this would perform more like you want (and expected).