Multithreading program stuck in optimized mode but runs normally in -O0

I wrote a simple multithreading programs as follows:

static bool finished = false;

int func()
{
    size_t i = 0;
    while (!finished)
        ++i;
    return i;
}

int main()
{
    auto result=std::async(std::launch::async, func);
    std::this_thread::sleep_for(std::chrono::seconds(1));
    finished=true;
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

It behaves normally in debug mode in Visual studio or -O0 in gcc and print out the result after 1 seconds. But it stuck and does not print anything in Release mode or -O1 -O2 -O3.

Two threads, accessing a non-atomic, non-guarded variable are U.B. This concerns finished. You could make finished of type std::atomic<bool> to fix this.

My fix:

#include <iostream>
#include <future>
#include <atomic>

static std::atomic<bool> finished = false;

int func()
{
    size_t i = 0;
    while (!finished)
        ++i;
    return i;
}

int main()
{
    auto result=std::async(std::launch::async, func);
    std::this_thread::sleep_for(std::chrono::seconds(1));
    finished=true;
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

Output:

result =1023045342
main thread id=140147660588864

Live Demo on coliru

Somebody may think 'It's a bool – probably one bit. How can this be non-atomic?' (I did when I started with multi-threading myself.)

But note that lack-of-tearing is not the only thing that std::atomic gives you. It also makes concurrent read+write access from multiple threads well-defined, stopping the compiler from assuming that re-reading the variable will always see the same value.

Making a bool unguarded, non-atomic can cause additional issues:

The compiler might decide to optimize variable into a register or even CSE multiple accesses into one and hoist a load out of a loop.
The variable might be cached for a CPU core. (In real life, CPUs have coherent caches. This is not a real problem, but the C++ standard is loose enough to cover hypothetical C++ implementations on non-coherent shared memory where atomic<bool> with memory_order_relaxed store/load would work, but where volatile wouldn't. Using volatile for this would be UB, even though it works in practice on real C++ implementations.)

To prevent this to happen, the compiler must be told explicitly not to do.

I'm a little bit surprised about the evolving discussion concerning the potential relation of volatile to this issue. Thus, I'd like to spent my two cents:

Is volatile useful with threads
Who's afraid of a big bad optimizing compiler?.

Scheff's answer describes how to fix your code. I thought I would add a little information on what is actually happening in this case.

I compiled your code at godbolt using optimisation level 1 (-O1). Your function compiles like so:

func():
  cmp BYTE PTR finished[rip], 0
  jne .L4
.L5:
  jmp .L5
.L4:
  mov eax, 0
  ret

So, what is happening here? First, we have a comparison: cmp BYTE PTR finished[rip], 0 - this checks to see if finished is false or not.

If it is not false (aka true) we should exit the loop on the first run. This accomplished by jne .L4 which jumps when not equal to label .L4 where the value of i (0) is stored in a register for later use and the function returns.

If it is false however, we move to

.L5:
  jmp .L5

This is an unconditional jump, to label .L5 which just so happens to be the jump command itself.

In other words, the thread is put into an infinite busy loop.

So why has this happened?

As far as the optimiser is concerned, threads are outside of its purview. It assumes other threads aren't reading or writing variables simultaneously (because that would be data-race UB). You need to tell it that it cannot optimise accesses away. This is where Scheff's answer comes in. I won't bother to repeat him.

Because the optimiser is not told that the finished variable may potentially change during execution of the function, it sees that finished is not modified by the function itself and assumes that it is constant.

The optimised code provides the two code paths that will result from entering the function with a constant bool value; either it runs the loop infinitely, or the loop is never run.

at -O0 the compiler (as expected) does not optimise the loop body and comparison away:

func():
  push rbp
  mov rbp, rsp
  mov QWORD PTR [rbp-8], 0
.L148:
  movzx eax, BYTE PTR finished[rip]
  test al, al
  jne .L147
  add QWORD PTR [rbp-8], 1
  jmp .L148
.L147:
  mov rax, QWORD PTR [rbp-8]
  pop rbp
  ret

therefore the function, when unoptimised does work, the lack of atomicity here is typically not a problem, because the code and data-type is simple. Probably the worst we could run into here is a value of i that is off by one to what it should be.

A more complex system with data-structures is far more likely to result in corrupted data, or improper execution.

For the sake of completeness in the learning curve; you should avoid using global variables. You did a good job though by making it static, so it will be local to the translation unit.

Here is an example:

class ST {
public:
    int func()
    {
        size_t i = 0;
        while (!finished)
            ++i;
        return i;
    }
    void setFinished(bool val)
    {
        finished = val;
    }
private:
    std::atomic<bool> finished = false;
};

int main()
{
    ST st;
    auto result=std::async(std::launch::async, &ST::func, std::ref(st));
    std::this_thread::sleep_for(std::chrono::seconds(1));
    st.setFinished(true);
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

Live on wandbox

Multithreading program stuck in optimized mode but runs normally in -O0

Related

Recent Posts