gcc optimization flag -O3 makes code slower than -O2

Solution 1:

gcc -O3 uses a cmov for the conditional, so it lengthens the loop-carried dependency chain to include a cmov (which is 2 uops and 2 cycles of latency on your Intel Sandybridge CPU, according to Agner Fog's instruction tables. See also the x86 tag wiki). This is one of the cases where cmov sucks.

If the data was even moderately unpredictable, cmov would probably be a win, so this is a fairly sensible choice for a compiler to make. (However, compilers may sometimes use branchless code too much.)

I put your code on the Godbolt compiler explorer to see the asm (with nice highlighting and filtering out irrelevant lines. You still have to scroll down past all the sort code to get to main(), though).

.L82:  # the inner loop from gcc -O3
    movsx   rcx, DWORD PTR [rdx]  # sign-extending load of data[c]
    mov     rsi, rcx
    add     rcx, rbx        # rcx = sum+data[c]
    cmp     esi, 127
    cmovg   rbx, rcx        # sum = data[c]>127 ? rcx : sum
    add     rdx, 4          # pointer-increment
    cmp     r12, rdx
    jne     .L82

gcc could have saved the MOV by using LEA instead of ADD.

The loop bottlenecks on the latency of ADD->CMOV (3 cycles), since one iteration of the loop writes rbx with CMO, and the next iteration reads rbx with ADD.

The loop only contains 8 fused-domain uops, so it can issue at one per 2 cycles. Execution-port pressure is also not as bad a bottleneck as the latency of the sum dep chain, but it's close (Sandybridge only has 3 ALU ports, unlike Haswell's 4).

BTW, writing it as sum += (data[c] >= 128 ? data[c] : 0); to take the cmov out of the loop-carried dep chain is potentially useful. Still lots of instructions, but the cmov in each iteration is independent. This compiles as expected in gcc6.3 -O2 and earlier, but gcc7 de-optimizes into a cmov on the critical path (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666). (It also auto-vectorizes with earlier gcc versions than the if() way of writing it.)

Clang takes the cmov off the critical path even with the original source.


gcc -O2 uses a branch (for gcc5.x and older), which predicts well because your data is sorted. Since modern CPUs use branch-prediction to handle control dependencies, the loop-carried dependency chain is shorter: just an add (1 cycle latency).

The compare-and-branch in every iteration is independent, thanks to branch-prediction + speculative execution, which lets execution continue before the branch direction is known for sure.

.L83:   # The inner loop from gcc -O2
    movsx   rcx, DWORD PTR [rdx]  # load with sign-extension from int32 to int64
    cmp     ecx, 127
    jle     .L82        # conditional-jump over the next instruction 
    add     rbp, rcx    # sum+=data[c]
.L82:
    add     rdx, 4
    cmp     rbx, rdx
    jne     .L83

There are two loop-carried dependency chains: sum and the loop-counter. sum is 0 or 1 cycle long, and the loop-counter is always 1 cycle long. However, the loop is 5 fused-domain uops on Sandybridge, so it can't execute at 1c per iteration anyway, so latency isn't a bottleneck.

It probably runs at about one iteration per 2 cycles (bottlenecked on branch instruction throughput), vs. one per 3 cycles for the -O3 loop. The next bottleneck would be ALU uop throughput: 4 ALU uops (in the not-taken case) but only 3 ALU ports. (ADD can run on any port).

This pipeline-analysis prediction matches pretty much exactly with your timings of ~3 sec for -O3 vs. ~2 sec for -O2.


Haswell/Skylake could run the not-taken case at one per 1.25 cycles, since it can execute a not-taken branch in the same cycle as a taken branch and has 4 ALU ports. (Or slightly less since a 5 uop loop doesn't quite issue at 4 uops every cycle).

(Just tested: Skylake @ 3.9GHz runs the branchy version of the whole program in 1.45s, or the branchless version in 1.68s. So the difference is much smaller there.)


g++6.3.1 uses cmov even at -O2, but g++5.4 still behaves like 4.9.2.

With both g++6.3.1 and g++5.4, using -fprofile-generate / -fprofile-use produces the branchy version even at -O3 (with -fno-tree-vectorize).

The CMOV version of the loop from newer gcc uses add ecx,-128 / cmovge rbx,rdx instead of CMP/CMOV. That's kinda weird, but probably doesn't slow it down. ADD writes an output register as well as flags, so creates more pressure on the number of physical registers. But as long as that's not a bottleneck, it should be about equal.


Newer gcc auto-vectorizes the loop with -O3, which is a significant speedup even with just SSE2. (e.g. my i7-6700k Skylake runs the vectorized version in 0.74s, so about twice as fast as scalar. Or -O3 -march=native in 0.35s, using AVX2 256b vectors).

The vectorized version looks like a lot of instructions, but it's not too bad, and most of them aren't part of a loop-carried dep chain. It only has to unpack to 64-bit elements near the end. It does pcmpgtd twice, though, because it doesn't realize it could just zero-extend instead of sign-extend when the condition has already zeroed all negative integers.