C for loop indexing: is forward-indexing faster in new CPUs?

On a mailing list I'm subscribed to, two fairly knowledgeable (IMO) programmers were discussing some optimized code, and saying something along the lines of:

On the CPUs released 5-8 years ago, it was slightly faster to iterate for loops backwards (e.g. for (int i=x-1; i>=0; i--) {...}) because comparing i to zero is more efficient than comparing it to some other number. But with very recent CPUs (e.g. from 2008-2009) the speculative loader logic is such that it works better if the for loop is iterated forward (e.g. for (int i=0; i< x; i++) {...}).

My question is, is that true? Have CPU implementations changed recently such that forward-loop-iterating now has an advantage over backward-iterating? If so, what is the explanation for that? i.e. what changed?

(Yes, I know, premature optimization is the root of all evil, review my algorithm before worrying about micro-optimizations, etc etc... mostly I'm just curious)


Solution 1:

You're really asking about prefetching, not about loop control logic.

In general, loop performance isn't going to be dictated by the control logic (i.e. the increment/decrement and the condition that gets checked every time through). The time it takes to do these things is inconsequential except in very tight loops. If you're interested in that, take a look at John Knoeller's answer for specifics on the 8086's counter register and why it might've been true in the old days that counting down was more efficient. As John says, branch prediction (and also speculation) can play a role in performance here, as can instruction prefetching.

Iteration order can affect performance significantly when it changes the order in which your loop touches memory. The order in which you request memory addresses can affect what is drawn into your cache and also what is evicted from your cache when there is no longer room to fetch new cache lines. Having to go to memory more often than needed is much more expensive than compares, increments, or decrements. On modern CPUs it can take thousands of cycles to get from the processor to memory, and your processor may have to idle for some or all of that time.

You're probably familiar with caches, so I won't go into all those details here. What you may not know is that modern processors employ a whole slew of prefetchers to try to predict what data you're going to need next at different levels of the memory hierarchy. Once they predict, they try to pull that data from memory or lower level caches so that you have what you need when you get around to processing it. Depending on how well they grab what you need next, your performance may or may not improve when using them.

Take a look at Intel's guide to optimizing for hardware prefetchers. There are four prefetchers listed; two for NetBurst chips:

  1. NetBurst's hardware prefetcher can detect streams of memory accesses in either forward or backward directions, and it will try to load data from those locations into the L2 cache.
  2. NetBurst also has an adjacent cache line (ACL) prefetcher, which will automatically load two adjacent cache lines when you fetch the first one.

and two for Core:

  1. Core has a slightly more sophisticated hardware prefetcher; it can detect strided access in addition to streams of contiguous references, so it'll do better if you step through an array every other element, every 4th, etc.
  2. Core also has an ACL prefetcher like NetBurst.

If you're iterating through an array forward, you're going to generate a bunch of sequential, usually contiguous memory references. The ACL prefetchers are going to do much better for forward loops (because you'll end up using those subsequent cache lines) than for backward loops, but you may do ok making memory references backward if the prefetchers can detect this (as with the hardware prefetchers). The hardware prefetchers on the Core can detect strides, which is helpful for for more sophisticated array traversals.

These simple heuristics can get you into trouble in some cases. For example, Intel actually recommends that you turn off adjacent cache line prefetching for servers, because they tend to make more random memory references than desktop user machines. The probability of not using an adjacent cache line is higher on a server, so fetching data you're not actually going to use ends up polluting your cache (filling it with unwanted data), and performance suffers. For more on addressing this kind of problem, take a look at this paper from Supercomputing 2009 on using machine learning to tune prefetchers in large data centers. Some guys at Google are on that paper; performance is something that is of great concern to them.

Simple heuristics aren't going to help you with more sophisticated algorithms, and you might have to start thinking about the sizes of your L1, L2, etc. caches. Image processing, for example, often requires that you perform some operation on subsections of a 2D image, but the order you traverse the image can affect how well useful pieces of it stay in your cache without being evicted. Take a look at Z-order traversals and loop tiling if you're interested in this sort of thing. It's a pretty basic example of mapping the 2D locality of image data to the 1D locality of memory to improve performance. It's also an area where compilers aren't always able to restructure your code in the best way, but manually restructuring your C code can improve cache performance drastically.

I hope this gives you an idea of how iteration order affects memory performance. It does depend on the particular architecture, but the ideas are general. You should be able to understand prefetching on AMD and Power if you can understand it on Intel, and you don't really have to know assembly to structure your code to take advantage of memory. You just need to know a little computer architecture.

Solution 2:

I don't know. But I do know how to write a quick benchmark with no guarantees of scientific validity (actually, one with rather strict guarantees of invalidity). It has interesting results:

#include <time.h>
#include <stdio.h>

int main(void)
{
    int i;
    int s;
    clock_t start_time, end_time;
    int centiseconds;

    start_time = clock();
    s = 1;
    for (i = 0; i < 1000000000; i++)
    {
        s = s + i;
    }
    end_time = clock();
    centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
    printf("Answer is %d; Forward took %ld centiseconds\n", s, centiseconds);

    start_time = clock();
    s = 1;
    for (i = 999999999; i >= 0; i--)
    {
        s = s + i;
    }
    end_time = clock();
    centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
    printf("Answer is %d; Backward took %ld centiseconds\n", s, centiseconds);

    return 0;
}

Compiled with -O9 using gcc 3.4.4 on Cygwin, running on an "AMD Athlon(tm) 64 Processor 3500+" (2211 MHz) in 32 bit Windows XP:

Answer is -1243309311; Forward took 93 centiseconds
Answer is -1243309311; Backward took 92 centiseconds

(Answers varied by 1 either way in several repetitions.)

Compiled with -I9 using gcc 4.4.1 running on an "Intel(R) Atom(TM) CPU N270 @ 1.60GHz" (800 MHz and presumably only one core, given the program) in 32 bit Ubuntu Linux.

Answer is -1243309311; Forward took 196 centiseconds
Answer is -1243309311; Backward took 228 centiseconds

(Answers varied by 1 either way in several repetitions.)

Looking at the code, the forward loop is translated to:

; Gcc 3.4.4 on Cygwin for Athlon      ; Gcc 4.4.1 on Ubuntu for Atom
L5:                                .L2:
    addl    %eax, %ebx                 addl    %eax, %ebx
    incl    %eax                       addl    $1, %eax
    cmpl    $999999999, %eax           cmpl    $1000000000, %eax
    jle     L5                         jne     .L2

The backward to:

L9:                                .L3:
    addl    %eax, %ebx                 addl    %eax, %ebx
    decl    %eax                       subl    $1, $eax
    jns     L9                         cmpl    $-1, %eax
                                       jne .L3

Which shows, if not much else, that GCC's behaviour has changed between those two versions!

Pasting the older GCC's loops into the newer GCC's asm file gives results of:

Answer is -1243309311; Forward took 194 centiseconds
Answer is -1243309311; Backward took 133 centiseconds

Summary: on the >5 year old Athlon, the loops generated by GCC 3.4.4 are the same speed. On the newish (<1 year?) Atom, the backward loop is significantly faster. GCC 4.4.1 has a slight regression for this particular case which I personally am not bothered about in the least, given the point of it. (I had to make sure that s is used after the loop, because otherwise the compiler would elide the computation altogether.)

[1] I can never remember the command for system info...