What the heque is going on with the memory overhead of std::deque?
I am working on an external sorting algorithm that uses std::queue
and must carefully constrain its memory usage. I have noticed that during the merge phase (which uses several std::queue
s of fixed length), my memory usage increases to about 2.5X what I expected. Since std::queue
by default uses std::deque
as its underlying container, I ran some tests on std::deque
to determine its memory overhead. Here are the results, running on VC++ 9, in release mode, with a 64-bit process:
When adding 100,000,000 char
s to a std::deque
, the memory usage grows to 252,216K. Note that 100M char
s (1 byte) should occupy 97,656K, so this is an overhead of 154,560K.
I repeated the test with double
s (8 bytes) and saw memory grow to 1,976,676K, while 100M double
s should occupy 781,250K, for an overhead of 1,195,426K!!
Now I understand that std::deque
is normally implemented as a linked list of "chunks." If this is true, then why is the overhead proportional to the element size (because of course the pointer size should be fixed at 8 bytes)? And why is it so danged huge?
Can anybody shed some light on why std::deque
uses so much danged memory? I'm thinking I should switch my std::queue
underlying containers to std::vector
as there is no overhead (given that the appropriate size is reserve
ed). I'm thinking the benefits of std::deque
are largely negated by the fact that it has such a huge overhead (resulting in cache misses, page faults, etc.), and that the cost of copying std::vector
elements may be less, given that the overall memory usage is so much lower. Is this just a bad implementation of std::deque
by Microsoft?
Solution 1:
Look at the code for _DEQUESIZ (number of elements per block):
#define _DEQUESIZ (sizeof (_Ty) <= 1 ? 16 \
: sizeof (_Ty) <= 2 ? 8 \
: sizeof (_Ty) <= 4 ? 4 \
: sizeof (_Ty) <= 8 ? 2 : 1) /* elements per block (a power of 2) */
It gets smaller if the element is larger. Only for elements larger than 8 bytes will you get the expected behavior (percentual decrease of overhead with increase of element size).
Solution 2:
Is it possible that you are running Debug binaries? 252MB for 100M chars does seem like a lot...
You can check attribution of this using umdh to snapshot before and after and then compare the two - might shed some light on why it's larger than you expected.
EDIT:
FYI - When I run this outside the debugger on VS2010 I get 181MB with char
s.
deque<char> mydequeue;
for (size_t i = 0; i < 100 * 1024 * 1024; ++i)
{
mydequeue.push_back(char(i));
}
EDIT: Supporting the other answer from @Dialecticus, this gives me the same footprint as double
:
struct twoInt64s
{
public:
twoInt64s(__int64 _a, __int64 _b) : a(_a), b(_b) {}
__int64 a;
__int64 b;
};
EDIT: With _DEQUESIZ
modified as shown (128 chars per block), 100M chars now takes 113M of memory.
My conclusion is that the remaining overhead you saw is due to management structures for the deque
blocks, which have 16 chars of data, plus control info for deque
plus more control info for heap manager.
#define _DEQUESIZ (sizeof (value_type) <= 1 ? 128 \
: sizeof (value_type) <= 2 ? 8 \
: sizeof (value_type) <= 4 ? 4 \
: sizeof (value_type) <= 8 ? 2 \
: 1) /* elements per block (a power of 2) */
Moral - if you really want to optimize this for your special purpose, be prepared to play with <deque>
. Its behaviour depends critically on the size of your elements, and beyond that on the expected usage pattern.
EDIT: Depending on your knowledge of queue sizes, you might be able to drop in boost::circular_buffer as a replacement for the std::queue container. I bet this would perform more like you want (and expected).