Why do C++ optimizers have problems with these temporary variables or rather why `v[]` should be avoided in tight loops?

In this code snippet, I'm comparing performance of two functionally identical loops:

for (int i = 1; i < v.size()-1; ++i) {
  int a = v[i-1];
  int b = v[i];
  int c = v[i+1];

  if (a < b  &&  b < c)
    ++n;
}

and

for (int i = 1; i < v.size()-1; ++i) 
  if (v[i-1] < v[i]  &&  v[i] < v[i+1])
    ++n;

The first one runs significantly slower than the second one across a number of different C++ compilers with optimization flag set to O2:

  • second loop is about 330% slower now with Clang 3.7.0
  • second loop is about 2% slower with gcc 4.9.3
  • second loop is about 2% slower with Visual C++ 2015

I'm puzzled that modern C++ optimizers have problems handling this case. Any clues why? Do I have to write ugly code without using temporary variables in order to get the best performance?

Using temporary variables makes the code faster, sometimes dramatically, now. What is going on?

The full code I'm using is provided below:

#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>

using namespace std;
using namespace std::chrono;

vector<int> v(1'000'000);

int f0()
{
  int n = 0;

  for (int i = 1; i < v.size()-1; ++i) {
    int a = v[i-1];
    int b = v[i];
    int c = v[i+1];

    if (a < b  &&  b < c)
      ++n;
  }

  return n;
}


int f1()
{
  int n = 0;

  for (int i = 1; i < v.size()-1; ++i) 
    if (v[i-1] < v[i]  &&  v[i] < v[i+1])
      ++n;

  return n;
}


int main()
{
  auto benchmark = [](int (*f)()) {
    const int N = 100;

    volatile long long result = 0;
    vector<long long>  timings(N);

    for (int i = 0; i < N; ++i) {
      auto t0 = high_resolution_clock::now(); 
      result += f();
      auto t1 = high_resolution_clock::now(); 

      timings[i] = duration_cast<nanoseconds>(t1-t0).count();
    }

    sort(timings.begin(), timings.end());
    cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
    cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
  };

  mt19937                    generator   (31415);   // deterministic seed
  uniform_int_distribution<> distribution(0, 1023);

  for (auto& e: v) 
    e = distribution(generator);

  benchmark(f0);
  benchmark(f1);

  cout << "\ndone\n";

  return 0;
}

It seems like the compiler lacks knowledge about the relationship between std::vector<>::size() and internal vector buffer size. Consider std::vector being our custom bugged_vector vector-like object with slight bug - its ::size() can sometimes be one more than internal buffer size n, but only then v[n-2] >= v[n-1].

Then two snippets have different semantics again: first one has undefined behavior, as we access element v[v.size() - 1]. The second one, however, doesn't have: due to short-circuit nature of &&, we don't ever read v[v.size() - 1] on the last iteration.

So, if compiler can't prove that our v is not a bugged_vector, it must short-circuit, which introduce additional jump in a machine code.

By looking at assembly output from clang, we can see that it actually happens.

From the Godbolt Compiler Explorer, with clang 3.7.0 -O2, the loop in f0 is:

### f0: just the loop
.LBB1_2:                                # =>This Inner Loop Header: Depth=1
    mov     edi, ecx
    cmp     edx, edi
    setl    r10b
    mov     ecx, dword ptr [r8 + 4*rsi + 4]
    lea     rsi, [rsi + 1]
    cmp     edi, ecx
    setl    dl
    and     dl, r10b
    movzx   edx, dl
    add     eax, edx
    cmp     rsi, r9
    mov     edx, edi
    jb      .LBB1_2

And for f1:

### f1: just the loop
.LBB2_2:                                # =>This Inner Loop Header: Depth=1
    mov     esi, r10d
    mov     r10d, dword ptr [r9 + 4*rdi]
    lea     rcx, [rdi + 1]
    cmp     esi, r10d
    jge     .LBB2_4                     # <== This is Extra Jump
    cmp     r10d, dword ptr [r9 + 4*rdi + 4]
    setl    dl
    movzx   edx, dl
    add     eax, edx
.LBB2_4:                                # %._crit_edge.3
    cmp     rcx, r8
    mov     rdi, rcx
    jb      .LBB2_2

I've pointed out the extra jump in f1. And as we (hopefuly) know, conditional jumps in a tight loops are bad for performance. (See the performance guides in the x86 tag wiki for details.)

GCC and Visual Studio are aware that std::vector is well-behaved, and produce almost identical assembly for both snippets. Edit. It turns out clang does better job optimizing the code. All three compilers can't prove that it is safe to read v[i + 1] prior to comparison in the second example (or choose not to), but only clang manages to optimize the first example with the additional information that reading v[i + 1] is either valid or UB.

A performance difference of 2% is negligible can be explained by different order or choice of some instructions.


Here's additional insight to expand on @deniss' answer, which correctly diagnosed the issue.

Incidentally, this is related to the most popular C++ Q&A of all time "Why is processing a sorted array faster than an unsorted array?".

The main issue is the compiler must honor the logical AND operator (&&) and not load from v[i+1] unless the first condition is true. This is a consequence of the semantics of the Logical AND operator as well as the tightened memory model semantics introduced with C++11, the relevant clauses in the draft of the standard are

5.14 Logical AND operator [expr.log.and]

Unlike &, && guarantees left-to-right evaluation: the second operand is not evaluated if the first operand is false.
ISO C++14 Standard (draft N3797)

and for speculative reads

1.10 Multi-threaded executions and data races [intro.multithread]

23 [ Note: Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the C++ program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection. — end note ]
ISO C++14 Standard (draft N3797)

My guess is optimizers play it safe and currently choose not to issue speculative loads to potentially shared memory rather than special case for each target processor whether the speculative load could introduce a detectable data race for that target.

In order to implement this, the compiler generates a conditional branch. Usually this isn't noticeable because modern processors have very sophisticated branch prediction, and the misprediction rate is typically very low. However the data here is random - this kills branch prediction. The cost of a misprediction is 10 to 20 CPU cycles, considering that the CPU is typically retiring 2 instructions per cycle this is equivalent to 20 to 40 instructions. If the prediction rate is 50% (random) then every iteration has a mispredict penalty equivalent to 10 to 20 instructions - HUGE.

Note: The compiler could prove that elements v[0] to v[v.size()-2] will be referenced, in that order, regardless of the values they contain. This would allow the compiler in this case to generate code that unconditionally loads all but the last element of the vector. The last element of the vector, at v[v.size()-1], may only be loaded in the last iteration of the loop and only if the first condition is true. The compiler could therefore generate code for the loop without the short circuit branch up until the last iteration, then use different code with the short circuit branch for the last iteration - that would require the compiler knowing that the data is random and branch prediction is useless and therefore that it is worth bothering with that - compilers aren't that sophisticated - yet.

To avoid the conditional branch generated by the Logical AND (&&) and avoid loading the memory locations into local variables we can change the Logical AND operator into a Bitwise AND, code snippet here, the result is almost 4x faster when the data is random

int f2()
{
  int n = 0;

  for (int i = 1; i < v.size()-1; ++i) 
     n += (v[i-1] < v[i])  &  (v[i] < v[i+1]); // Bitwise AND

  return n;
}

Output

3.642443ms min
3.779982ms median
Result: 166634

3.725968ms min
3.870808ms median
Result: 166634

1.052786ms min
1.081085ms median
Result: 166634


done

The result on gcc 5.3 is 8x faster (live in Coliru here)

g++ --version
g++ -std=c++14  -O3 -Wall -Wextra -pedantic -pthread -pedantic-errors main.cpp -lm  && ./a.out
g++ (GCC) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

3.761290ms min
4.025739ms median
Result: 166634

3.823133ms min
4.050742ms median
Result: 166634

0.459393ms min
0.505011ms median
Result: 166634


done

You might wonder how the compiler can evaluate the comparison v[i-1] < v[i] without generating a conditional branch. The answer depends on the target, for x86 this is possible because of the SETcc instruction, which generates a one byte result, 0 or 1, depending on a condition in the EFLAGS register, the same condition that could be used in a conditional branch, but without branching. In the generated code given by @deniss you can see setl generated, that sets the result to 1 if the condition "less than" is met, which is evaluated by the previous compare instruction:

cmp     edx, edi       ; a < b ?
setl    r10b           ; r10b = a < b ? 1 : 0
mov     ecx, dword ptr [r8 + 4*rsi + 4] ; c = v[i+1]
lea     rsi, [rsi + 1] ; ++i
cmp     edi, ecx       ; b < c ?
setl    dl             ; dl = b < c ? 1 : 0
and     dl, r10b       ; dl &= r10b
movzx   edx, dl        ; edx = zero extended dl
add     eax, edx       ; n += edx

f0 and f1 are semantically different.

x() && y() involves a short-circuit in the case of x() being false as we know. This means that if x() is false , then y() must not be evaluated.

This prevents prefetching of the data in order to evaluate y() and (at least on clang) is causing the insertion of a conditional jump, which is resulting in branch-predictor misses.

Adding another 2 tests proves the point.

#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>

using namespace std;
using namespace std::chrono;

vector<int> v(1'000'000);

int f0()
{
    int n = 0;

    for (int i = 1; i < v.size()-1; ++i) {
        int a = v[i-1];
        int b = v[i];
        int c = v[i+1];

        if (a < b  &&  b < c)
            ++n;
    }

    return n;
}


int f1()
{
    int n = 0;

    auto s = v.size() - 1;
    for (size_t i = 1; i < s; ++i)
        if (v[i-1] < v[i]  &&  v[i] < v[i+1])
            ++n;

    return n;
}

int f2()
{
    int n = 0;

    auto s = v.size() - 1;
    for (size_t i = 1; i < s; ++i)
    {
        auto t1 = v[i-1] < v[i];
        auto t2 = v[i] < v[i+1];
        if (t1 && t2)
            ++n;
    }

    return n;
}

int f3()
{
    int n = 0;

    auto s = v.size() - 1;
    for (size_t i = 1; i < s; ++i)
    {
        n += 1 * (v[i-1] < v[i]) * (v[i] < v[i+1]);
    }

    return n;
}



int main()
{
    auto benchmark = [](int (*f)()) {
        const int N = 100;

        volatile long long result = 0;
        vector<long long>  timings(N);

        for (int i = 0; i < N; ++i) {
            auto t0 = high_resolution_clock::now();
            result += f();
            auto t1 = high_resolution_clock::now();

            timings[i] = duration_cast<nanoseconds>(t1-t0).count();
        }

        sort(timings.begin(), timings.end());
        cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
        cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
    };

    mt19937                    generator   (31415);   // deterministic seed
    uniform_int_distribution<> distribution(0, 1023);

    for (auto& e: v) 
        e = distribution(generator);

    benchmark(f0);
    benchmark(f1);
    benchmark(f2);
    benchmark(f3);

    cout << "\ndone\n";

    return 0;
}

results (apple clang, -O2):

1.233948ms min
1.320545ms median
Result: 166850

3.366751ms min
3.493069ms median
Result: 166850

1.261948ms min
1.361748ms median
Result: 166850

1.251434ms min
1.353653ms median
Result: 166850