gcc optimization flag -O3 makes code slower than -O2
Solution 1:
gcc -O3
uses a cmov for the conditional, so it lengthens the loop-carried dependency chain to include a cmov
(which is 2 uops and 2 cycles of latency on your Intel Sandybridge CPU, according to Agner Fog's instruction tables. See also the x86 tag wiki). This is one of the cases where cmov
sucks.
If the data was even moderately unpredictable, cmov
would probably be a win, so this is a fairly sensible choice for a compiler to make. (However, compilers may sometimes use branchless code too much.)
I put your code on the Godbolt compiler explorer to see the asm (with nice highlighting and filtering out irrelevant lines. You still have to scroll down past all the sort code to get to main(), though).
.L82: # the inner loop from gcc -O3
movsx rcx, DWORD PTR [rdx] # sign-extending load of data[c]
mov rsi, rcx
add rcx, rbx # rcx = sum+data[c]
cmp esi, 127
cmovg rbx, rcx # sum = data[c]>127 ? rcx : sum
add rdx, 4 # pointer-increment
cmp r12, rdx
jne .L82
gcc could have saved the MOV by using LEA instead of ADD.
The loop bottlenecks on the latency of ADD->CMOV (3 cycles), since one iteration of the loop writes rbx with CMO, and the next iteration reads rbx with ADD.
The loop only contains 8 fused-domain uops, so it can issue at one per 2 cycles. Execution-port pressure is also not as bad a bottleneck as the latency of the sum
dep chain, but it's close (Sandybridge only has 3 ALU ports, unlike Haswell's 4).
BTW, writing it as sum += (data[c] >= 128 ? data[c] : 0);
to take the cmov
out of the loop-carried dep chain is potentially useful. Still lots of instructions, but the cmov
in each iteration is independent. This compiles as expected in gcc6.3 -O2
and earlier, but gcc7 de-optimizes into a cmov
on the critical path (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666). (It also auto-vectorizes with earlier gcc versions than the if()
way of writing it.)
Clang takes the cmov off the critical path even with the original source.
gcc -O2
uses a branch (for gcc5.x and older), which predicts well because your data is sorted. Since modern CPUs use branch-prediction to handle control dependencies, the loop-carried dependency chain is shorter: just an add
(1 cycle latency).
The compare-and-branch in every iteration is independent, thanks to branch-prediction + speculative execution, which lets execution continue before the branch direction is known for sure.
.L83: # The inner loop from gcc -O2
movsx rcx, DWORD PTR [rdx] # load with sign-extension from int32 to int64
cmp ecx, 127
jle .L82 # conditional-jump over the next instruction
add rbp, rcx # sum+=data[c]
.L82:
add rdx, 4
cmp rbx, rdx
jne .L83
There are two loop-carried dependency chains: sum
and the loop-counter. sum
is 0 or 1 cycle long, and the loop-counter is always 1 cycle long. However, the loop is 5 fused-domain uops on Sandybridge, so it can't execute at 1c per iteration anyway, so latency isn't a bottleneck.
It probably runs at about one iteration per 2 cycles (bottlenecked on branch instruction throughput), vs. one per 3 cycles for the -O3 loop. The next bottleneck would be ALU uop throughput: 4 ALU uops (in the not-taken case) but only 3 ALU ports. (ADD can run on any port).
This pipeline-analysis prediction matches pretty much exactly with your timings of ~3 sec for -O3 vs. ~2 sec for -O2.
Haswell/Skylake could run the not-taken case at one per 1.25 cycles, since it can execute a not-taken branch in the same cycle as a taken branch and has 4 ALU ports. (Or slightly less since a 5 uop loop doesn't quite issue at 4 uops every cycle).
(Just tested: Skylake @ 3.9GHz runs the branchy version of the whole program in 1.45s, or the branchless version in 1.68s. So the difference is much smaller there.)
g++6.3.1 uses cmov
even at -O2
, but g++5.4 still behaves like 4.9.2.
With both g++6.3.1 and g++5.4, using -fprofile-generate
/ -fprofile-use
produces the branchy version even at -O3
(with -fno-tree-vectorize
).
The CMOV version of the loop from newer gcc uses add ecx,-128
/ cmovge rbx,rdx
instead of CMP/CMOV. That's kinda weird, but probably doesn't slow it down. ADD writes an output register as well as flags, so creates more pressure on the number of physical registers. But as long as that's not a bottleneck, it should be about equal.
Newer gcc auto-vectorizes the loop with -O3, which is a significant speedup even with just SSE2. (e.g. my i7-6700k Skylake runs the vectorized version
in 0.74s, so about twice as fast as scalar. Or -O3 -march=native
in 0.35s, using AVX2 256b vectors).
The vectorized version looks like a lot of instructions, but it's not too bad, and most of them aren't part of a loop-carried dep chain. It only has to unpack to 64-bit elements near the end. It does pcmpgtd
twice, though, because it doesn't realize it could just zero-extend instead of sign-extend when the condition has already zeroed all negative integers.