Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs
I was looking for the fastest way to popcount
large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned
to uint64_t
made the performance drop by 50% on my PC.
The Benchmark
#include <iostream>
#include <chrono>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
if (argc != 2) {
cerr << "usage: array_size in MB" << endl;
return -1;
}
uint64_t size = atol(argv[1])<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer = reinterpret_cast<char*>(buffer);
for (unsigned i=0; i<size; ++i)
charbuffer[i] = rand()%256;
uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
startP = chrono::system_clock::now();
count = 0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with unsigned
for (unsigned i=0; i<size/8; i+=4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count += _mm_popcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
startP = chrono::system_clock::now();
count=0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with uint64_t
for (uint64_t i=0;i<size/8;i+=4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count += _mm_popcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
free(charbuffer);
}
As you see, we create a buffer of random data, with the size being x
megabytes where x
is read from the command line. Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount
intrinsic to perform the popcount. To get a more precise result, we do the popcount 10,000 times. We measure the times for the popcount. In the upper case, the inner loop variable is unsigned
, in the lower case, the inner loop variable is uint64_t
. I thought that this should make no difference, but the opposite is the case.
The (absolutely crazy) results
I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):
g++ -O3 -march=native -std=c++11 test.cpp -o test
Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1
(so 1 MB random data):
- unsigned 41959360000 0.401554 sec 26.113 GB/s
- uint64_t 41959360000 0.759822 sec 13.8003 GB/s
As you see, the throughput of the uint64_t
version is only half the one of the unsigned
version! The problem seems to be that different assembly gets generated, but why? First, I thought of a compiler bug, so I tried clang++
(Ubuntu Clang version 3.4-1ubuntu3):
clang++ -O3 -march=native -std=c++11 teest.cpp -o test
Result: test 1
- unsigned 41959360000 0.398293 sec 26.3267 GB/s
- uint64_t 41959360000 0.680954 sec 15.3986 GB/s
So, it is almost the same result and is still strange. But now it gets super strange. I replace the buffer size that was read from input with a constant 1
, so I change:
uint64_t size = atol(argv[1]) << 20;
to
uint64_t size = 1 << 20;
Thus, the compiler now knows the buffer size at compile time. Maybe it can add some optimizations! Here are the numbers for g++
:
- unsigned 41959360000 0.509156 sec 20.5944 GB/s
- uint64_t 41959360000 0.508673 sec 20.6139 GB/s
Now, both versions are equally fast. However, the unsigned
got even slower! It dropped from 26
to 20 GB/s
, thus replacing a non-constant by a constant value lead to a deoptimization. Seriously, I have no clue what is going on here! But now to clang++
with the new version:
- unsigned 41959360000 0.677009 sec 15.4884 GB/s
- uint64_t 41959360000 0.676909 sec 15.4906 GB/s
Wait, what? Now, both versions dropped to the slow number of 15 GB/s. Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang!
I asked a colleague with an Ivy Bridge CPU to compile my benchmark. He got similar results, so it does not seem to be Haswell. Because two compilers produce strange results here, it also does not seem to be a compiler bug. We do not have an AMD CPU here, so we could only test with Intel.
More madness, please!
Take the first example (the one with atol(argv[1])
) and put a static
before the variable, i.e.:
static uint64_t size=atol(argv[1])<<20;
Here are my results in g++:
- unsigned 41959360000 0.396728 sec 26.4306 GB/s
- uint64_t 41959360000 0.509484 sec 20.5811 GB/s
Yay, yet another alternative. We still have the fast 26 GB/s with u32
, but we managed to get u64
at least from the 13 GB/s to the 20 GB/s version! On my collegue's PC, the u64
version became even faster than the u32
version, yielding the fastest result of all. Sadly, this only works for g++
, clang++
does not seem to care about static
.
My question
Can you explain these results? Especially:
- How can there be such a difference between
u32
andu64
? - How can replacing a non-constant by a constant buffer size trigger less optimal code?
- How can the insertion of the
static
keyword make theu64
loop faster? Even faster than the original code on my collegue's computer!
I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. Of course, I always want to have the version that is able to popcount 26 GB/s. The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. This is the only way I can get rid of compilers that seem to go mad on small changes. What do you think? Is there another way to reliably get the code with most performance?
The Disassembly
Here is the disassembly for the various results:
26 GB/s version from g++ / u32 / non-const bufsize:
0x400af8:
lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8
13 GB/s version from g++ / u64 / non-const bufsize:
0x400c00:
popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00
15 GB/s version from clang++ / u64 / non-const bufsize:
0x400e50:
popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50
20 GB/s version from g++ / u32&u64 / const bufsize:
0x400a68:
popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68
15 GB/s version from clang++ / u32&u64 / const bufsize:
0x400dd0:
popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0
Interestingly, the fastest (26 GB/s) version is also the longest! It seems to be the only solution that uses lea
. Some versions use jb
to jump, others use jne
. But apart from that, all versions seem to be comparable. I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. The slowest (13 GB/s) version looks even very short and good. Can anyone explain this?
Lessons learned
No matter what the answer to this question will be; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code. I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static
keyword in front of the size variable! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance.
The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. So even if you unroll, you can still get hit by major performance deviations. Quite interesting.
Culprit: False Data Dependency (and the compiler isn't even aware of it)
On Sandy/Ivy Bridge and Haswell processors, the instruction:
popcnt src, dest
appears to have a false dependency on the destination register dest
. Even though the instruction only writes to it, the instruction will wait until dest
is ready before executing. This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)
Skylake fixed this for lzcnt
and tzcnt
.
Cannon Lake (and Ice Lake) fixed this for popcnt
.bsf
/bsr
have a true output dependency: output unmodified for input=0. (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)
(Yes, these instructions all run on the same execution unit).
This dependency doesn't just hold up the 4 popcnt
s from a single loop iteration. It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations.
The unsigned
vs. uint64_t
and other tweaks don't directly affect the problem. But they influence the register allocator which assigns the registers to the variables.
In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.
- 13 GB/s has a chain:
popcnt
-add
-popcnt
-popcnt
→ next iteration - 15 GB/s has a chain:
popcnt
-add
-popcnt
-add
→ next iteration - 20 GB/s has a chain:
popcnt
-popcnt
→ next iteration - 26 GB/s has a chain:
popcnt
-popcnt
→ next iteration
The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing. Either way, the processor starts to hit other bottlenecks once you reach this speed.
To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want. I also split up the count
variable to break all other dependencies that might mess with the benchmarks.
Here are the results:
Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom)
- GCC 4.6.3:
g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native
- Ubuntu 12
Different Registers: 18.6195 GB/s
.L4:
movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq 16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq $4, %rax
popcnt %r8, %r8
add %r8, %rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10, %r10
add %r10, %rdi
popcnt %r11, %r11
add %r11, %rsi
cmpq $131072, %rax
jne .L4
Same Register: 8.49272 GB/s
.L9:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# This time reuse "rax" for all the popcnts.
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L9
Same Register with broken chain: 17.8869 GB/s
.L14:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# Reuse "rax" for all the popcnts.
xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L14
So what went wrong with the compiler?
It seems that neither GCC nor Visual Studio are aware that popcnt
has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.
popcnt
isn't exactly the most used instruction. So it's not really a surprise that a major compiler could miss something like this. There also appears to be no documentation anywhere that mentions this problem. If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance.
(Update: As of version 4.9.2, GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)
Why does the CPU have such a false dependency?
We can speculate: it runs on the same execution unit as bsf
/ bsr
which do have an output dependency. (How is POPCNT implemented in hardware?). For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified. AMD documents this behaviour.
Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not.
AMD processors do not appear to have this false dependency.
The full test code is below for reference:
#include <iostream>
#include <chrono>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
uint64_t size=1<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer=reinterpret_cast<char*>(buffer);
for (unsigned i=0;i<size;++i) charbuffer[i]=rand()%256;
uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %4 \n\t"
"add %4, %0 \n\t"
"popcnt %5, %5 \n\t"
"add %5, %1 \n\t"
"popcnt %6, %6 \n\t"
"add %6, %2 \n\t"
"popcnt %7, %7 \n\t"
"add %7, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "No Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %%rax \n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add %%rax, %1 \n\t"
"popcnt %6, %%rax \n\t"
"add %%rax, %2 \n\t"
"popcnt %7, %%rax \n\t"
"add %%rax, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
: "rax"
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "Chain 4 \t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"xor %%rax, %%rax \n\t" // <--- Break the chain.
"popcnt %4, %%rax \n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add %%rax, %1 \n\t"
"popcnt %6, %%rax \n\t"
"add %%rax, %2 \n\t"
"popcnt %7, %%rax \n\t"
"add %%rax, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
: "rax"
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "Broken Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
free(charbuffer);
}
An equally interesting benchmark can be found here: http://pastebin.com/kbzgL8si
This benchmark varies the number of popcnt
s that are in the (false) dependency chain.
False Chain 0: 41959360000 0.57748 sec 18.1578 GB/s
False Chain 1: 41959360000 0.585398 sec 17.9122 GB/s
False Chain 2: 41959360000 0.645483 sec 16.2448 GB/s
False Chain 3: 41959360000 0.929718 sec 11.2784 GB/s
False Chain 4: 41959360000 1.23572 sec 8.48557 GB/s
I coded up an equivalent C program to experiment, and I can confirm this strange behaviour. What's more, gcc
believes the 64-bit integer (which should probably be a size_t
anyway...) to be better, as using uint_fast32_t
causes gcc to use a 64-bit uint.
I did a bit of mucking around with the assembly:
Simply take the 32-bit version, replace all 32-bit instructions/registers with the 64-bit version in the inner popcount-loop of the program. Observation: the code is just as fast as the 32-bit version!
This is obviously a hack, as the size of the variable isn't really 64 bit, as other parts of the program still use the 32-bit version, but as long as the inner popcount-loop dominates performance, this is a good start.
I then copied the inner loop code from the 32-bit version of the program, hacked it up to be 64 bit, fiddled with the registers to make it a replacement for the inner loop of the 64-bit version. This code also runs as fast as the 32-bit version.
My conclusion is that this is bad instruction scheduling by the compiler, not actual speed/latency advantage of 32-bit instructions.
(Caveat: I hacked up assembly, could have broken something without noticing. I don't think so.)