Why is a simple loop optimized when the limit is 959 but not 960?
Consider this simple loop:
float f(float x[]) {
float p = 1.0;
for (int i = 0; i < 959; i++)
p += 1;
return p;
}
If you compile with gcc 7 (snapshot) or clang (trunk) with -march=core-avx2 -Ofast
you get something very similar to.
.LCPI0_0:
.long 1148190720 # float 960
f: # @f
vmovss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
ret
In other words it just sets the answer to 960 without looping.
However if you change the code to:
float f(float x[]) {
float p = 1.0;
for (int i = 0; i < 960; i++)
p += 1;
return p;
}
The produced assembly actually performs the loop sum? For example clang gives:
.LCPI0_0:
.long 1065353216 # float 1
.LCPI0_1:
.long 1086324736 # float 6
f: # @f
vmovss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
vxorps ymm1, ymm1, ymm1
mov eax, 960
vbroadcastss ymm2, dword ptr [rip + .LCPI0_1]
vxorps ymm3, ymm3, ymm3
vxorps ymm4, ymm4, ymm4
.LBB0_1: # =>This Inner Loop Header: Depth=1
vaddps ymm0, ymm0, ymm2
vaddps ymm1, ymm1, ymm2
vaddps ymm3, ymm3, ymm2
vaddps ymm4, ymm4, ymm2
add eax, -192
jne .LBB0_1
vaddps ymm0, ymm1, ymm0
vaddps ymm0, ymm3, ymm0
vaddps ymm0, ymm4, ymm0
vextractf128 xmm1, ymm0, 1
vaddps ymm0, ymm0, ymm1
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
vaddps ymm0, ymm0, ymm1
vhaddps ymm0, ymm0, ymm0
vzeroupper
ret
Why is this and why is it exactly the same for clang and gcc?
The limit for the same loop if you replace float
with double
is 479. This is the same for gcc and clang again.
Update 1
It turns out that gcc 7 (snapshot) and clang (trunk) behave very differently. clang optimizes out the loops for all limits less than 960 as far as I can tell. gcc on the other hand is sensitive to the exact value and doesn't have an upper limit . For example it does not optimize out the loop when the limit is 200 (as well as many other values) but it does when the limit is 202 and 20002 (as well as many other values).
Solution 1:
TL;DR
By default, the current snapshot GCC 7 behaves inconsistently, while previous versions have default limit due to PARAM_MAX_COMPLETELY_PEEL_TIMES
, which is 16. It can be overridden from command-line.
The rationale of the limit is to prevent too aggressive loop unrolling, that can be a double-edged sword.
GCC version <= 6.3.0
The relevant optimization option for GCC is -fpeel-loops
, which is enabled indirectly along with flag -Ofast
(emphasis is mine):
Peels loops for which there is enough information that they do not roll much (from profile feedback or static analysis). It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations).
Enabled with
-O3
and/or-fprofile-use
.
More details can be obtained by adding -fdump-tree-cunroll
:
$ head test.c.151t.cunroll
;; Function f (f, funcdef_no=0, decl_uid=1919, cgraph_uid=0, symbol_order=0)
Not peeling: upper bound is known so can unroll completely
The message is from /gcc/tree-ssa-loop-ivcanon.c
:
if (maxiter >= 0 && maxiter <= npeel)
{
if (dump_file)
fprintf (dump_file, "Not peeling: upper bound is known so can "
"unroll completely\n");
return false;
}
hence try_peel_loop
function returns false
.
More verbose output can be reached with -fdump-tree-cunroll-details
:
Loop 1 iterates 959 times.
Loop 1 iterates at most 959 times.
Not unrolling loop 1 (--param max-completely-peeled-times limit reached).
Not peeling: upper bound is known so can unroll completely
It is possible to tweak the limits by plaing with max-completely-peeled-insns=n
and max-completely-peel-times=n
params:
max-completely-peeled-insns
The maximum number of insns of a completely peeled loop.
max-completely-peel-times
The maximum number of iterations of a loop to be suitable for complete peeling.
To learn more about insns, you can refer to GCC Internals Manual.
For instance, if you compile with following options:
-march=core-avx2 -Ofast --param max-completely-peeled-insns=1000 --param max-completely-peel-times=1000
then code turns into:
f:
vmovss xmm0, DWORD PTR .LC0[rip]
ret
.LC0:
.long 1148207104
Clang
I am not sure what Clang actually does and how to tweak its limits, but as I observed, you could force it to evaluate the final value by marking the loop with unroll pragma, and it will remove it completely:
#pragma unroll
for (int i = 0; i < 960; i++)
p++;
results into:
.LCPI0_0:
.long 1148207104 # float 961
f: # @f
vmovss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
ret
Solution 2:
After reading Sulthan's comment, I guess that:
The compiler fully unrolls the loop if the loop counter is constant (and not too high)
Once it's unrolled, the compiler sees that the sum operations can be grouped into one.
If the loop isn't unrolled for some reason (here: it would generate too many statements with 1000
), the operations cannot be grouped.
The compiler could see that the unroll of 1000 statements amounts to a single addition, but step 1 & 2 described above are two separate optimizations, so it cannot take the "risk" of unrolling, not knowing if the operations can be grouped (example: a function call cannot be grouped).
Note: This is a corner case: Who uses a loop to add the same thing over again? In that case, don't rely on the compiler possible unroll/optimise; directly write the proper operation in one instruction.
Solution 3:
Very good question!
You seem to have hit a limit on the number of iterations or operations the compiler tries to inline when simplifying the code. As documented by Grzegorz Szpetkowski, there are compiler specific ways to tweak these limits with pragmas or command line options.
You can also play with Godbolt's Compiler Explorer to compare how different compilers and options impact the code generated: gcc 6.2
and icc 17
still inline the code for 960, whereas clang 3.9
does not (with the default Godbolt configuration, it actually stops inlining at 73).