Reset C int array to zero : the fastest way?

Assuming that we have a T myarray[100] with T = int, unsigned int, long long int or unsigned long long int, what is the fastest way to reset all its content to zero (not only for initialization but to reset the content several times in my program)? Maybe with memset?

Same question for a dynamic array like T *myarray = new T[100].

Solution 1:

memset (from <string.h>) is probably the fastest standard way, since it's usually a routine written directly in assembly and optimized by hand.

memset(myarray, 0, sizeof(myarray)); // for automatically-allocated arrays
memset(myarray, 0, N*sizeof(*myarray)); // for heap-allocated arrays, where N is the number of elements

By the way, in C++ the idiomatic way would be to use std::fill (from <algorithm>):

std::fill(myarray, myarray+N, 0);

which may be optimized automatically into a memset; I'm quite sure that it will work as fast as memset for ints, while it may perform slightly worse for smaller types if the optimizer isn't smart enough. Still, when in doubt, profile.

Solution 2:

This question, although rather old, needs some benchmarks, as it asks for not the most idiomatic way, or the way that can be written in the fewest number of lines, but the fastest way. And it is silly to answer that question without some actual testing. So I compared four solutions, memset vs. std::fill vs. ZERO of AnT's answer vs a solution I made using AVX intrinsics.

Note that this solution is not generic, it only works on data of 32 or 64 bits. Please comment if this code is doing something incorrect.

#include<immintrin.h>
#define intrin_ZERO(a,n){\
size_t x = 0;\
const size_t inc = 32 / sizeof(*(a));/*size of 256 bit register over size of variable*/\
for (;x < n-inc;x+=inc)\
    _mm256_storeu_ps((float *)((a)+x),_mm256_setzero_ps());\
if(4 == sizeof(*(a))){\
    switch(n-x){\
    case 3:\
        (a)[x] = 0;x++;\
    case 2:\
        _mm_storeu_ps((float *)((a)+x),_mm_setzero_ps());break;\
    case 1:\
        (a)[x] = 0;\
        break;\
    case 0:\
        break;\
    };\
}\
else if(8 == sizeof(*(a))){\
switch(n-x){\
    case 7:\
        (a)[x] = 0;x++;\
    case 6:\
        (a)[x] = 0;x++;\
    case 5:\
        (a)[x] = 0;x++;\
    case 4:\
        _mm_storeu_ps((float *)((a)+x),_mm_setzero_ps());break;\
    case 3:\
        (a)[x] = 0;x++;\
    case 2:\
        ((long long *)(a))[x] = 0;break;\
    case 1:\
        (a)[x] = 0;\
        break;\
    case 0:\
        break;\
};\
}\
}

I will not claim that this is the fastest method, since I am not a low level optimization expert. Rather it is an example of a correct architecture dependent implementation that is faster than memset.

Now, onto the results. I calculated performance for size 100 int and long long arrays, both statically and dynamically allocated, but with the exception of msvc, which did a dead code elimination on static arrays, the results were extremely comparable, so I will show only dynamic array performance. Time markings are ms for 1 million iterations, using time.h's low precision clock function.

clang 3.8 (Using the clang-cl frontend, optimization flags= /OX /arch:AVX /Oi /Ot)

int:
memset:      99
fill:        97
ZERO:        98
intrin_ZERO: 90

long long:
memset:      285
fill:        286
ZERO:        285
intrin_ZERO: 188

gcc 5.1.0 (optimization flags: -O3 -march=native -mtune=native -mavx):

int:
memset:      268
fill:        268
ZERO:        268
intrin_ZERO: 91
long long:
memset:      402
fill:        399
ZERO:        400
intrin_ZERO: 185

msvc 2015 (optimization flags: /OX /arch:AVX /Oi /Ot):

int
memset:      196
fill:        613
ZERO:        221
intrin_ZERO: 95
long long:
memset:      273
fill:        559
ZERO:        376
intrin_ZERO: 188

There is a lot interesting going on here: llvm killing gcc, MSVC's typical spotty optimizations (it does an impressive dead code elimination on static arrays and then has awful performance for fill). Although my implementation is significantly faster, this may only be because it recognizes that bit clearing has much less overhead than any other setting operation.

Clang's implementation merits more looking at, as it is significantly faster. Some additional testing shows that its memset is in fact specialized for zero--non zero memsets for 400 byte array are much slower (~220ms) and are comparable to gcc's. However, the nonzero memsetting with an 800 byte array makes no speed difference, which is probably why in that case, their memset has worse performance than my implementation--the specialization is only for small arrays, and the cuttoff is right around 800 bytes. Also note that gcc 'fill' and 'ZERO' are not optimizing to memset (looking at generated code), gcc is simply generating code with identical performance characteristics.

Conclusion: memset is not really optimized for this task as well as people would pretend it is (otherwise gcc and msvc and llvm's memset would have the same performance). If performance matters then memset should not be a final solution, especially for these awkward medium sized arrays, because it is not specialized for bit clearing, and it is not hand optimized any better than the compiler can do on its own.

Solution 3:

From memset():

memset(myarray, 0, sizeof(myarray));

You can use sizeof(myarray) if the size of myarray is known at compile-time. Otherwise, if you are using a dynamically-sized array, such as obtained via malloc or new, you will need to keep track of the length.

Reset C int array to zero : the fastest way?

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts