Empty loop is slower than a non-empty one in C

Assuming that your code uses a 32 bit integer int type (which your system probably does), then nothing can be determined from your code. Instead, it exhibits undefined behavior.

foo.c:5:5: error: first parameter of 'main' (argument count) must be of type 'int'
int main (char argc, char * argv[]) {
    ^
foo.c:13:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow]
    for (i = 0; i<(1<<31)-1; i++);
                         ^
foo.c:19:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow]
    for (i = 0; i<(1<<31)-1; i++) {
                         ^

Let's try to fix that:

#include <stdint.h>
#include <stdio.h>
#include <time.h>
#include <limits.h>

int main (int argc, char * argv[]) {
    time_t begin, end;
    uint64_t i;
    double total_time, free_time;
    int A = 1;
    int B = 1;

    begin = clock();
    for (i = 0; i<INT_MAX; i++);
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    begin = clock();
    for (i = 0; i<INT_MAX; i++) {
        A += B%2;
    }
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    return(0);
}

Now, let's look at the assembly output of this code. Personally, I find LLVM's internal assembly very readable, so I'm going to show that. I'll produce it by running:

clang -O3 foo.c -S -emit-llvm -std=gnu99

Here's the relevant portion of the output (the main function):

define i32 @main(i32 %argc, i8** nocapture readnone %argv) #0 {
  %1 = tail call i64 @"\01_clock"() #3
  %2 = tail call i64 @"\01_clock"() #3
  %3 = sub nsw i64 %2, %1
  %4 = sitofp i64 %3 to double
  %5 = fdiv double %4, 1.000000e+06
  %6 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %5) #3
  %7 = tail call i64 @"\01_clock"() #3
  %8 = tail call i64 @"\01_clock"() #3
  %9 = sub nsw i64 %8, %7
  %10 = sitofp i64 %9 to double
  %11 = fdiv double %10, 1.000000e+06
  %12 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %11) #3
  ret i32 0
}

Note that there are no operations between the calls to clock() for either case. So they are both compiled to the exact same thing.

The fact is that modern processors are complicated. All the instructions executed will interact with each other in complicated and interesting ways. Thanks for "that other guy" for posting the code.

Both OP and "that other guy" apparently found that the short loop takes 11 cycles, while the long one takes 9 cycles. For the long loop, 9 cycles is plenty of time even though there are lot of operations. For the short loop, there must be some stall caused by it being so short, and just adding a nop makes the loop long enough to avoid the stall.

One thing that happens if we look at the code:

0x00000000004005af <+50>:    addq   $0x1,-0x20(%rbp)
0x00000000004005b4 <+55>:    cmpq   $0x7fffffff,-0x20(%rbp)
0x00000000004005bc <+63>:    jb     0x4005af <main+50>

We read i and write it back (addq). We read it immediately again, and compare it (cmpq). And then we loop. But the loop uses branch prediction. So at the time when the addq is executed, the processor isn't really sure it is allowed to write to i (because branch prediction could be wrong).

Then we compare to i. The processor will try to avoid reading i from memory, because reading it takes a long time. Instead some bit of hardware will remember that we just wrote to i by adding to it, and instead of reading i, the cmpq instruction gets the data from the store instruction. Unfortunately, we are not sure at this point if the write to i actually happened or not! So that could introduce a stall here.

The problem here is that the conditional jump, the addq which leads to a conditional store, and the cmpq which isn't sure where to get the data from, are all very very close together. They are unusually close together. It could be that they are so close together, the processor cannot figure out at this time whether to take i from the store instruction or to read it from memory. And reads it from memory, which is slower because it has to wait for the store to finish. And adding just one nop gives the processor enough time.

Usually you think that there is RAM, and there is cache. On a modern Intel processor, reading memory can read from (slowest to fastest):

Memory (RAM)
L3 cache (optional)
L2 cache
L1 cache
Previous store instruction that hasn't written to L1 cache yet.

So what the processor does internally in the short, slow loop:

Read i from L1 cache
Add 1 to i
Write i to L1 cache
Wait until i is written to L1 cache
Read i from L1 cache
Compare i with INT_MAX
Branch to (1) if it is less.

In the long, fast, loop the processor does:

Lots of stuff
Read i from L1 cache
Add 1 to i
Do a "store" instruction that will write i to L1 cache
Read i directly from the "store" instruction without touching L1 cache
Compare i with INT_MAX
Branch to (1) if it is less.

Empty loop is slower than a non-empty one in C

Related

Recent Posts