Loop with function call faster than an empty loop

I linked some assembly with some c to test the cost of a function call, with the following assembly and c source (using fasm and gcc respectively)

assembly:

format ELF

public no_call as "_no_call"
public normal_call as "_normal_call"

section '.text' executable

iter equ 100000000

no_call:
    mov ecx, iter
@@:
    push ecx
    pop ecx
    dec ecx
    cmp ecx, 0
    jne @b
    ret

normal_function:
    ret

normal_call:
    mov ecx, iter
@@:
    push ecx
    call normal_function
    pop ecx
    dec ecx
    cmp ecx, 0
    jne @b
    ret

c source:

#include <stdio.h>
#include <time.h>

extern int no_call();
extern int normal_call();

int main()
{
    clock_t ct1, ct2;

    ct1 = clock();
    no_call();
    ct2 = clock();
    printf("\n\n%d\n", ct2 - ct1);

    ct1 = clock();
    normal_call();
    ct2 = clock();
    printf("%d\n", ct2 - ct1);

    return 0;
}

The results I got were suprising. First of all, the speed depended on the order in which I linked mattered. If I linked as gcc intern.o extern.o, a typical output is

162
181

But linking in the opposite order gcc extern.o intern.o, I got an output more like:

162
130

That they are different was very suprising but is not the question I am asking. (relevant question here)

The question I am asking is how is it that in the second run the loop with the function call was faster than the loop without one, how was the cost of calling a function apparently negative.

Edit: Just to make mention of some of the things tried in the comments:

  • In the compiled bytecode the function calls were not optimized away.
  • Adjusting the alignment of the functions and loops to be on everything from 4 to 64 byte boundaries did not speed up no_call, though some alignments did slow down normal_call
  • Giving the CPU/OS a chance to warm up by calling the functions multiple times rather than just once had no noticable effect of the lengths of times measured, neither does changing the order of the calls or running seperately
  • Running for longer times does not affect the ratio, for example running 1000 times longer I got 162.168 and 131.578 seconds for my run times

Additionally, after modifying the assembly code to align on bytes, I tested giving the set of functions an additional offset and came to some more strange conclusions. Here is the updated code:

format ELF

public no_call as "_no_call"
public normal_call as "_normal_call"

section '.text' executable

iter equ 100000000

offset equ 23 ; this is the number I am changing
times offset nop

times 16 nop
no_call:
    mov ecx, iter
no_call.loop_start:
    push ecx
    pop ecx
    dec ecx
    cmp ecx, 0
    jne no_call.loop_start
    ret

times 55 nop
normal_function:
    ret


times 58 nop
normal_call:
    mov ecx, iter
normal_call.loop_start:
    push ecx
    call normal_function
    pop ecx
    dec ecx
    cmp ecx, 0
    jne normal_call.loop_start
    ret

I had to manually (and non portably) force the 64 byte alignment since FASM does not support more than 4 byte alignment for the executable section, at least on my machine. Offsetting the program by offset bytes, here is what I found.

if (20 <= offset mod 128 <= 31) then we get an output of (approximately):

162
131

else

162 (+/- 10)
162 (+/- 10)

Not sure at all what to make of it, but that's what I've discovered so far

Edit 2:

Another thing I noticed is that if you remove push ecx and pop ecx from both functions, the output becomes

30
125

which indicates that that is the most expensive part of it. The stack alignment is the same both times, so that is not the reason for the discrepency. My best guess is that somehow the hardware is optimized to expect a call after a push or something similar, but I do not know of anything like that


Solution 1:

Update: Skylake store/reload latency is as low as 3c, but only if the timing is right. Consecutive loads involved in a store-forwarding dependency chain that are naturally spaced out by 3 or more cycles will experience the faster latency (e.g. with 4 imul eax,eax in the loop, mov [rdi], eax / mov eax, [rdi] only takes the cycle count up from 12 to 15 cycles per iteration.) but when the loads are allow to execute more densely than that, some type of contention is suffered and you get about 4.5 cycles per iteration. The non-integer average throughput is also a big clue that there's something unusual.

I saw the same effect for 32B vectors (best case 6.0c, back-to-back 6.2 to 6.9c), but 128b vectors were always around 5.0c. See details on Agner Fog's forum.

Update2: Adding a redundant assignment speeds up code when compiled without optimization and a 2013 blog post indicate that this effect is present on all Sandybridge-family CPUs.

The back-to-back (worst case) store-forwarding latency on Skylake is 1 cycle better than on previous uarches, but the variability when the load can't execute right away is similar.


With the right (mis-)alignment, the extra call in the loop can actually help Skylake observe lower store-forwarding latency from push to pop. I was able to reproduce this with perf counters (Linux perf stat -r4), using YASM. (I've heard it's less convenient to use perf counters on Windows, and I don't have a Windows dev machine anyway. Fortunately the OS isn't really relevant to the answer; anyone should be able to reproduce my perf-counter results on Windows with VTune or something.)

I saw the faster times at offset = 0..10, 37, 63-74, 101, and 127 following an align 128 at the spot specified in the question. L1I cache lines are 64B, and the uop-cache is cares about 32B boundaries. It looks alignment relative to a 64B boundary is all that matters.

The no-call loop is a steady 5 cycles always, but the call loop can get down to 4c per iteration from its usual almost-exactly-5 cycles. I saw slower-than-usual performance at offset=38 (5.68 +- 8.3% cycles per iteration). There are small glitches at other points, like 5.17c +- 3.3%, according to perf stat -r4 (which does 4 runs and averaging).

It seems to be an interaction between the front-end not queueing up so many uops ahead, causing the back end to have lower latency for store-forwarding from push to pop.

IDK if reusing the same address repeatedly for store-forwarding makes it slower (with multiple store-address uops already executed ahead of the corresponding store-data uops), or what.


Test code: bash shell loop to build & profile the asm with every different offset:

(set -x; for off in {0..127};do 
    asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=$off && 
    ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults:u,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,dsb2mite_switches.penalty_cycles -r4 ./call-tight-loop;
done ) |& tee -a call-tight-loop.call.offset-log

(set -x) in a subshell is a handy way to log commands along with their output when redirecting to a log file.

asm-link is a script that runs yasm -felf32 -Worphan-labels -gdwarf2 call-tight-loop.asm "$@" && ld -melf_i386 -o call-tight-loop call-tight-loop.o, then runs objdumps -drwC -Mintel on the result.

NASM / YASM Linux test program (assembles into a complete static binary that runs the loop and then exits, so you can profile the whole program.) Direct port of the OP's FASM source, with no optimizations to the asm.

CPU p6    ; YASM directive.  For NASM, %use smartalign.
section .text
iter equ 100000000

%ifndef OFFSET
%define OFFSET 0
%endif

align 128
;;offset equ 23 ; this is the number I am changing
times OFFSET nop

times 16 nop
no_call:
    mov ecx, iter
.loop:
    push ecx
    pop ecx
    dec ecx
    cmp ecx, 0
    jne .loop
    ret

times 55 nop
normal_function:
    ret

times 58 nop
normal_call:
    mov ecx, iter
.loop:
    push ecx
    call normal_function
    pop ecx
    dec ecx
    cmp ecx, 0
    jne .loop
    ret

%ifndef FUNC
%define FUNC no_call
%endif

align 64
global _start
_start:
    call FUNC

    mov eax,1             ; __NR_exit from /usr/include/asm/unistd_32.h
    xor ebx,ebx
    int 0x80              ; sys_exit(0), 32-bit ABI

Sample output from a fast call run:

+ asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=3
...

080480d8 <normal_function>:
 80480d8:       c3                      ret    
...

08048113 <normal_call>:
 8048113:       b9 00 e1 f5 05          mov    ecx,0x5f5e100
08048118 <normal_call.loop>:
 8048118:       51                      push   ecx
 8048119:       e8 ba ff ff ff          call   80480d8 <normal_function>
 804811e:       59                      pop    ecx
 804811f:       49                      dec    ecx
 8048120:       83 f9 00                cmp    ecx,0x0
 8048123:       75 f3                   jne    8048118 <normal_call.loop>
 8048125:       c3                      ret    

 ...

 Performance counter stats for './call-tight-loop' (4 runs):

    100.646932      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.97% )
             0      context-switches          #    0.002 K/sec                    ( +-100.00% )
             0      cpu-migrations            #    0.000 K/sec                  
             1      page-faults:u             #    0.010 K/sec                  
   414,143,323      cycles                    #    4.115 GHz                      ( +-  0.56% )
   700,193,469      instructions              #    1.69  insn per cycle           ( +-  0.00% )
   700,293,232      uops_issued_any           # 6957.919 M/sec                    ( +-  0.00% )
 1,000,299,201      uops_executed_thread      # 9938.695 M/sec                    ( +-  0.00% )
    83,212,779      idq_mite_uops             #  826.779 M/sec                    ( +- 17.02% )
         5,792      dsb2mite_switches_penalty_cycles #    0.058 M/sec                    ( +- 33.07% )

   0.100805233 seconds time elapsed                                          ( +-  0.96% )

Old answer before noticing the variable store-forwarding latency

You push/pop your loop counter, so everything except the call and ret instructions (and the cmp/jcc) are part of the critical path loop-carried dependency chain involving the loop counter.

You'd expect that pop would have to wait for updates to the stack pointer by call/ret, but the stack engine handles those updates with zero latency. (Intel since Pentium-M, AMD since K10, according to Agner Fog's microarch pdf, so I'm assuming your CPU has one, even though you didn't say anything about what CPU microarchitecture you ran your tests on.)

The extra call/ret still need to execute, but out-of-order execution can keep the critical path instructions running at their max throughput. Since this includes the latency of a store->load forwarding from push/pop + 1 cycle for dec, this is not high throughput on any CPU, and it's a surprise that the front-end can ever be a bottleneck with any alignment.

push->pop latency is 5 cycles on Skylake, according to Agner Fog, so on that uarch your loop can only run at best one iteration per 6 cycles. This is plenty of time for out-of-order-execution to run the call and ret instructions. Agner lists a max throughput for call of one per 3 cycles, and ret at one per 1 cycle. Or on AMD Bulldozer, 2 and 2. His tables don't list anything about the throughput of a call/ret pair, so IDK whether those can overlap or not. On AMD Bulldozer, store/reload latency with mov is 8 cycles. I assume it's about the same with push/pop.

It seems that different alignments for the top of the loop (i.e. no_call.loop_start:) are causing front-end bottlenecks. The call version has 3 branches per iteration: the call, the ret, and the loop-branch. Note that the ret's branch target is the instruction right after the call. Each of these potentially disrupts the front-end. Since you're seeing an actual slow-down in practice, we must be seeing more than 1 cycle delay per branch. Or for the no_call version, a single fetch/decode bubble worse than about 6 cycles, leading to an actual wasted cycle in issuing uops into the out-of-order part of the core. That's weird.

It's too complicated to guess about what the actual microarchitectural details are for every possible uarch, so let us know what CPU you tested on.

I will mention though that push/pop inside a loop on Skylake stops it from issuing from the Loop Stream Detector, and has to be re-fetched from the uop cache every time. Intel's optimization manual says that for Sandybridge, a mismatched push/pop inside a loop stops it from using the LSD. That implies that it can use the LSD for loops with balanced push/pop. In my testing, that's not the case on Skylake (using the lsd.uops performance counter), but I haven't seen any mention of whether that was a change, or whether SnB was actually like that, too.

Also, unconditional branches always end a uop-cache line. It's possible that with normal_function: in the same naturally-aligned 32B chunk of machine code as the call and jne, maybe the block of code doesn't fit in the uop cache. (Only 3 uop-cache lines can cache decoded uops for a single 32B chunk of x86 code). But that wouldn't explain the possibility of problems for the no_call loop, so you're probably not running on an Intel SnB-family microarchitecture.

(Update, yes, the loop sometimes runs mostly from legacy decode (idq.mite_uops), but usually not exclusively. dsb2mite_switches.penalty_cycles is usually ~8k, and probably only happens on timer interrupts. The runs where the call loop runs faster seem to be correlated with lower idq.mite_uops, but it's still 34M +- 63% for the offset=37 case where the 100M iterations took 401M cycles.)

This is really one of those "don't do that" cases: inline tiny functions instead of calling them from inside very tight loops.


You might see different results if you push/pop a register other than your loop counter. That would separate the push/pop from the loop counter, so there'd be 2 separate dependency chains. It should speed up both the call and no_call versions, but maybe no equally. It could just make a front-end bottleneck more obvious.

You should see a huge speedup if you push edx but pop eax, so the push/pop instructions don't form a loop-carried dependency chain. Then the extra call/ret would definitely be a bottleneck.


Side-note: dec ecx already sets ZF the way you want, so you could have just used dec ecx / jnz. Also, cmp ecx,0 is less efficient than test ecx,ecx (larger code size and can't macro-fuse on as many CPUs). Anyway, totally irrelevant to the question about relative performance of your two loops. (Your lack of an ALIGN directive between functions means that changing the first one would have changed the alignment of the loop branch in the 2nd, but you already explored different alignments.)