Difference between execution policies and when to use them

What is the difference between seq and par/par_unseq?

std::for_each(std::execution::seq, std::begin(v), std::end(v), function_call);

std::execution::seq stands for sequential execution. It is the default if you do not specify the execution policy at all. It will force the implementation to execute all function calls in sequence. It is also guaranteed that everything is executed by the calling thread.

In contrast, std::execution::par and std::execution::par_unseq implies parallel execution. That means you promise that all invocations of the given function can be safely executed in parallel without violating any data dependencies. The implementation is allowed to use a parallel implementation, though it is not forced to do so.

What is the difference between par and par_unseq?

par_unseq requires stronger guarantees than par, but allows additional optimizations. Specifically, par_unseq requires the option to interleave the execution of multiple function calls in the same thread.

Let us illustrate the difference with an example. Suppose you want to parallelize this loop:

std::vector<int> v = { 1, 2, 3 };
int sum = 0;
std::for_each(std::execution::seq, std::begin(v), std::end(v), [&](int i) {
  sum += i*i;
});

You cannot directly parallelize the code above, as it would introduce a data dependency for the sum variable. To avoid that, you can introduce a lock:

int sum = 0;
std::mutex m;
std::for_each(std::execution::par, std::begin(v), std::end(v), [&](int i) {
  std::lock_guard<std::mutex> lock{m};
  sum += i*i;
});

Now all function calls can be safely executed in parallel, and the code will not break when you switch to par. But what would happen if you use par_unseq instead, where one thread could potentially execute multiple function calls not in sequence but concurrently?

It can result in a deadlock, for instance, if the code is reordered like this:

 m.lock();    // iteration 1 (constructor of std::lock_guard)
 m.lock();    // iteration 2
 sum += ...;  // iteration 1
 sum += ...;  // iteration 2
 m.unlock();  // iteration 1 (destructor of std::lock_guard)
 m.unlock();  // iteration 2

In the standard, the term is vectorization-unsafe. To quote from P0024R2:

A standard library function is vectorization-unsafe if it is specified to synchronize with another function invocation, or another function invocation is specified to synchronize with it, and if it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions may not be invoked by user code called from parallel_vector_execution_policy algorithms.

One way to make the code above vectorization-safe, is to replace the mutex by an atomic:

std::atomic<int> sum{0};
std::for_each(std::execution::par_unseq, std::begin(v), std::end(v), [&](int i) {
  sum.fetch_add(i*i, std::memory_order_relaxed);
});

What are the advantages of using par_unseq over par?

The additional optimizations that an implementation can use in par_unseq mode include vectorized execution and migrations of work across threads (the latter is relevant if task parallelism is used with a parent-stealing scheduler).

If vectorization is allowed, implementations can internally use SIMD parallelism (Single-Instruction, Multiple-Data). For example, OpenMP supports it via #pragma omp simd annotations, which can help compilers to generate better code.

When should I prefer std::execution::seq?

  1. correctness (avoiding data races)
  2. avoiding parallel overhead (startup costs and synchronization)
  3. simplicity (debugging)

It is not uncommon that data dependencies will enforce sequential execution. In other words, use sequential execution if parallel execution would add data races.

Rewriting and tuning the code for parallel execution is not always trivial. Unless it is a critical part of your application, you can start with a sequential version and optimize later. You also might want to avoid parallel execution if you are executing the code in a shared environment where you need to be conservative in resource usage.

Parallelism also does not come for free. If the expected total execution time of the loop is very low, sequential execution will most likely be the best even from a pure performance perspective. The bigger the data and the more costly each computation step is, the less important the synchronization overhead will be.

For instance, using parallelism in the example above would not make sense, as the vector only contains three elements and the operations are very cheap. Also note that the original version - before the introduction of mutexes or atomics - contained no synchronization overhead. A common mistake in measuring speedup of a parallel algorithm is to use a parallel version running on one CPU as the baseline. Instead, you should always compare against an optimized sequential implementation without the synchronization overhead.

When should I prefer std::execution::par_unseq?

First, make sure that it does not sacrifice correctness:

  • If there are data races when executing steps in parallel by different threads, par_unseq is not an option.
  • If the code is vectorization-unsafe, for instance, because it acquires a lock, par_unseq is not an option (but par might be).

Otherwise, use par_unseq if it is a performance critical part and par_unseq improves the performance over seq.

When should I prefer std::execution::par?

If the steps can be executed safely in parallel, but you cannot use par_unseq because it is vectorization-unsafe, it is a candidate for par.

Like par_unseq, verify that it is a performance critical part and par is a performance improvement over seq.

Sources:

  • cppreference.com (execution policy)
  • P0024R2: The Parallelism TS Should be Standardized

seq means "execute sequentially" and is the exact same thing as the version without an execution policy.

par means "execute in parallel", which permits the implementation to execute on multiple threads in parallel. You are responsible for making sure that no data races happen within f.

par_unseq means that in addition to being allowed to execute in multiple threads, the implementation is also allowed to interleave individual loop iterations within a single thread, i.e. load multiple elements and execute f on all of them only afterwards. This is required to permit a vectorized implementation.