What does `std::kill_dependency` do, and why would I want to use it?

I've been reading about the new C++11 memory model and I've come upon the std::kill_dependency function (§29.3/14-15). I'm struggling to understand why I would ever want to use it.

I found an example in the N2664 proposal but it didn't help much.

It starts by showing code without std::kill_dependency. Here, the first line carries a dependency into the second, which carries a dependency into the indexing operation, and then carries a dependency into the do_something_with function.

r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[r2]);

There is further example that uses std::kill_dependency to break the dependency between the second line and the indexing.

r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[std::kill_dependency(r2)]);

As far as I can tell, this means that the indexing and the call to do_something_with are not dependency ordered before the second line. According to N2664:

This allows the compiler to reorder the call to do_something_with, for example, by performing speculative optimizations that predict the value of a[r2].

In order to make the call to do_something_with the value a[r2] is needed. If, hypothetically, the compiler "knows" that the array is filled with zeros, it can optimize that call to do_something_with(0); and reorder this call relative to the other two instructions as it pleases. It could produce any of:

// 1
r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(0);
// 2
r1 = x.load(memory_order_consume);
do_something_with(0);
r2 = r1->index;
// 3
do_something_with(0);
r1 = x.load(memory_order_consume);
r2 = r1->index;

Is my understanding correct?

If do_something_with synchronizes with another thread by some other means, what does this mean with respect to the ordering of the x.load call and this other thread?

Assuming my understading is correct, there's still one thing that bugs me: when I'm writing code, what reasons would lead me to choose to kill a dependency?


The purpose of memory_order_consume is to ensure the compiler does not do certain unfortunate optimizations that may break lockless algorithms. For example, consider this code:

int t;
volatile int a, b;

t = *x;
a = t;
b = t;

A conforming compiler may transform this into:

a = *x;
b = *x;

Thus, a may not equal b. It may also do:

t2 = *x;
// use t2 somewhere
// later
t = *x;
a = t2;
b = t;

By using load(memory_order_consume), we require that uses of the value being loaded not be moved prior to the point of use. In other words,

t = x.load(memory_order_consume);
a = t;
b = t;
assert(a == b); // always true

The standard document considers a case where you may only be interested in ordering certain fields of a structure. The example is:

r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[std::kill_dependency(r2)]);

This instructs the compiler that it is allowed to, effectively, do this:

predicted_r2 = x->index; // unordered load
r1 = x; // ordered load
r2 = r1->index;
do_something_with(a[predicted_r2]); // may be faster than waiting for r2's value to be available

Or even this:

predicted_r2 = x->index; // unordered load
predicted_a  = a[predicted_r2]; // get the CPU loading it early on
r1 = x; // ordered load
r2 = r1->index; // ordered load
do_something_with(predicted_a);

If the compiler knows that do_something_with won't change the result of the loads for r1 or r2, then it can even hoist it all the way up:

do_something_with(a[x->index]); // completely unordered
r1 = x; // ordered
r2 = r1->index; // ordered

This allows the compiler a little more freedom in its optimization.


In addition to the other answer, I will point out that Scott Meyers, one of the definitive leaders in the C++ community, bashed memory_order_consume pretty strongly. He basically said that he believed it had no place in the standard. He said there are two cases where memory_order_consume has any effect:

  • Exotic architectures designed to support 1024+ core shared memory machines.
  • The DEC Alpha

Yes, once again, the DEC Alpha finds its way into infamy by using an optimization not seen in any other chip until many years later on absurdly specialized machines.

The particular optimization is that those processors allow one to dereference a field before actually getting the address of that field (i.e. it can look up x->y BEFORE it even looks up x, using a predicted value of x). It then goes back and determines whether x was the value it expected it to be. On success, it saved time. On failure, it has to go back and get x->y again.

Memory_order_consume tells the compiler/architecture that these operations have to happen in order. However, in the most useful case, one will end up wanting to do (x->y.z), where z doesn't change. memory_order_consume would force the compiler to keep x y and z in order. kill_dependency(x->y).z tells the compiler/architecture that it may resume doing such nefarious reorderings.

99.999% of developers will probably never work on a platform where this feature is required (or has any effect at all).