C++ OpenMP Parallel For Loop - Alternatives to std::vector [closed]
I think you can use std::vector
with OpenMP most of the time and still have good performance. The following code for example fills std::vectors
in parallel and then combines them in the end. As long as your main loop/fill function is the bottleneck this should work well in general and be thread safe.
std::vector<int> vec;
#pragma omp parallel
{
std::vector<int> vec_private;
#pragma omp for nowait //fill vec_private in parallel
for(int i=0; i<100; i++) {
vec_private.push_back(i);
}
#pragma omp critical
vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}
Edit:
OpenMP 4.0 allows user-defined reductions using #pragma omp declare reduction
. The code above can be simplified with to
#pragma omp declare reduction (merge : std::vector<int> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
std::vector<int> vec;
#pragma omp parallel for reduction(merge: vec)
for(int i=0; i<100; i++) vec.push_back(i);
Edit: What I have shown so far does not fill the vector in order. If the order matters then this can be done like this
std::vector<int> vec;
#pragma omp parallel
{
std::vector<int> vec_private;
#pragma omp for nowait schedule(static)
for(int i=0; i<N; i++) {
vec_private.push_back(i);
}
#pragma omp for schedule(static) ordered
for(int i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}
}
This avoids saving a std::vector for each thread and then merging them in serial outside of the parallel region. I learned about this "trick" here. I'm not sure how to do this (or if it's even possible) for user-defined reductions.. It's not possible to do this with user-defined reductions.
I just realized that the critical section is not necessary which I figured out from this question parallel-cumulative-prefix-sums-in-openmp-communicating-values-between-thread. This method also gets the order correct as well
std::vector<int> vec;
size_t *prefix;
#pragma omp parallel
{
int ithread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
#pragma omp single
{
prefix = new size_t[nthreads+1];
prefix[0] = 0;
}
std::vector<int> vec_private;
#pragma omp for schedule(static) nowait
for(int i=0; i<100; i++) {
vec_private.push_back(i);
}
prefix[ithread+1] = vec_private.size();
#pragma omp barrier
#pragma omp single
{
for(int i=1; i<(nthreads+1); i++) prefix[i] += prefix[i-1];
vec.resize(vec.size() + prefix[nthreads]);
}
std::copy(vec_private.begin(), vec_private.end(), vec.begin() + prefix[ithread]);
}
delete[] prefix;
The question you link was talking about the fact that "that STL vector container is not thread-safe in the situation where multiple threads write to a single container" . This is only true, as stated correctly there, if you call methods that can cause reallocation of the underlying array that std::vector
holds. push_back()
, pop_back()
and insert()
are examples of these dangerous methods.
If you need thread safe reallocation, then the library intel thread building block offers you concurrent vector containers . You should not use tbb::concurrent_vector in single thread programs because the time it takes to access random elements is higher than the time std::vector takes to do the same (which is O(1)). However, concurrent vector calls push_back()
, pop_back()
, insert()
in a thread safe way, even when reallocation happens.
EDIT 1: The slides 46 and 47 of the following Intel presentation give an illustrative example of concurrent reallocation using tbb::concurrent_vector
EDIT 2: By the way, if you start using Intel Tread Building Block (it is open source, it works with most compilers and it is much better integrated with C++/C++11 features than openmp), then you don't need to use openmp to create a parallel_for, Here is a nice example of parallel_for using tbb.