Faster approach to checking for an all-zero buffer in C?

Solution 1:

On many architectures, comparing 1 byte takes the same amount of time as 4 or 8, or sometimes even 16. 4 bytes is normally easy (either int or long), and 8 is too (long or long long). 16 or higher probably requires inline assembly to e.g., use a vector unit.

Also, a branch mis-predictions really hurt, it may help to eliminate branches. For example, if the buffer is almost always empty, instead of testing each block against 0, bit-or them together and test the final result.


Expressing this is difficult in portable C: casting a char* to long* violates strict aliasing. But fortunately you can use memcpy to portably express an unaligned multi-byte load that can alias anything. Compilers will optimize it to the asm you want.

For example, this work-in-progress implementation (https://godbolt.org/z/3hXQe7) on the Godbolt compiler explorer shows that you can get a good inner loop (with some startup overhead) from loading two consecutive uint_fast32_t vars (often 64-bit) with memcpy and then checking tmp1 | tmp2, because many CPUs will set flags according to an OR result, so this lets you check two words for the price of one.

Getting it to compile efficiently for targets without efficient unaligned loads requires some manual alignment in the startup code, and even then gcc may not inline the memcpy for loads where it can't prove alignment.

Solution 2:

One potential way, inspired by Kieveli's dismissed idea:

int is_empty(char *buf, size_t size)
{
    static const char zero[999] = { 0 };
    return !memcmp(zero, buf, size > 999 ? 999 : size);
}

Note that you can't make this solution work for arbitrary sizes. You could do this:

int is_empty(char *buf, size_t size)
{
    char *zero = calloc(size);
    int i = memcmp(zero, buf, size);
    free(zero);
    return i;
}

But any dynamic memory allocation is going to be slower than what you have. The only reason the first solution is faster is because it can use memcmp(), which is going to be hand-optimized in assembly language by the library writers and will be much faster than anything you could code in C.

EDIT: An optimization no one else has mentioned, based on earlier observations about the "likelyness" of the buffer to be in state X: If a buffer isn't empty, will it more likely not be empty at the beginning or the end? If it's more likely to have cruft at the end, you could start your check at the end and probably see a nice little performance boost.

EDIT 2: Thanks to Accipitridae in the comments:

int is_empty(char *buf, size_t size)
{
    return buf[0] == 0 && !memcmp(buf, buf + 1, size - 1);
}

This basically compares the buffer to itself, with an initial check to see if the first element is zero. That way, any non-zero elements will cause memcmp() to fail. I don't know how this would compare to using another version, but I do know that it will fail quickly (before we even loop) if the first element is nonzero. If you're more likely to have cruft at the end, change buf[0] to buf[size] to get the same effect.

Solution 3:

The benchmarks given above (https://stackoverflow.com/a/1494499/2154139) are not accurate. They imply that func3 is much faster than the other options.

However, if you change the order of the tests, so that func3 comes before func2, you'd see func2 is much faster.

Careful when running combination benchmarks within a single execution... the side effects are large, especially when reusing the same variables. Better to run the tests isolated!

For example, changing it to:

int main(){
  MEASURE( func3 );
  MEASURE( func3 );
  MEASURE( func3 );
  MEASURE( func3 );
  MEASURE( func3 );
}

gives me:

func3: zero          14243
func3: zero           1142
func3: zero            885
func3: zero            848
func3: zero            870

This was really bugging me as I couldn't see how func3 could perform so much faster than func2.

(apologize for the answer, and not as a comment, didn't have reputation)