std::mutex performance compared to win32 CRITICAL_SECTION
Solution 1:
Please see my updates at the end of the answer, the situation has dramatically changed since Visual Studio 2015. The original answer is below.
I made a very simple test and according to my measurements the std::mutex
is around 50-70x slower than CRITICAL_SECTION
.
std::mutex: 18140574us
CRITICAL_SECTION: 296874us
Edit: After some more tests it turned out it depends on number of threads (congestion) and number of CPU cores. Generally, the std::mutex
is slower, but how much, it depends on use. Following are updated test results (tested on MacBook Pro with Core i5-4258U, Windows 10, Bootcamp):
Iterations: 1000000
Thread count: 1
std::mutex: 78132us
CRITICAL_SECTION: 31252us
Thread count: 2
std::mutex: 687538us
CRITICAL_SECTION: 140648us
Thread count: 4
std::mutex: 1031277us
CRITICAL_SECTION: 703180us
Thread count: 8
std::mutex: 86779418us
CRITICAL_SECTION: 1634123us
Thread count: 16
std::mutex: 172916124us
CRITICAL_SECTION: 3390895us
Following is the code that produced this output. Compiled with Visual Studio 2012, default project settings, Win32 release configuration. Please note that this test may not be perfectly correct but it made me think twice before switching my code from using CRITICAL_SECTION
to std::mutex
.
#include "stdafx.h"
#include <Windows.h>
#include <mutex>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>
const int g_cRepeatCount = 1000000;
const int g_cThreadCount = 16;
double g_shmem = 8;
std::mutex g_mutex;
CRITICAL_SECTION g_critSec;
void sharedFunc( int i )
{
if ( i % 2 == 0 )
g_shmem = sqrt(g_shmem);
else
g_shmem *= g_shmem;
}
void threadFuncCritSec() {
for ( int i = 0; i < g_cRepeatCount; ++i ) {
EnterCriticalSection( &g_critSec );
sharedFunc(i);
LeaveCriticalSection( &g_critSec );
}
}
void threadFuncMutex() {
for ( int i = 0; i < g_cRepeatCount; ++i ) {
g_mutex.lock();
sharedFunc(i);
g_mutex.unlock();
}
}
void testRound(int threadCount)
{
std::vector<std::thread> threads;
auto startMutex = std::chrono::high_resolution_clock::now();
for (int i = 0; i<threadCount; ++i)
threads.push_back(std::thread( threadFuncMutex ));
for ( std::thread& thd : threads )
thd.join();
auto endMutex = std::chrono::high_resolution_clock::now();
std::cout << "std::mutex: ";
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endMutex - startMutex).count();
std::cout << "us \n\r";
threads.clear();
auto startCritSec = std::chrono::high_resolution_clock::now();
for (int i = 0; i<threadCount; ++i)
threads.push_back(std::thread( threadFuncCritSec ));
for ( std::thread& thd : threads )
thd.join();
auto endCritSec = std::chrono::high_resolution_clock::now();
std::cout << "CRITICAL_SECTION: ";
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endCritSec - startCritSec).count();
std::cout << "us \n\r";
}
int _tmain(int argc, _TCHAR* argv[]) {
InitializeCriticalSection( &g_critSec );
std::cout << "Iterations: " << g_cRepeatCount << "\n\r";
for (int i = 1; i <= g_cThreadCount; i = i*2) {
std::cout << "Thread count: " << i << "\n\r";
testRound(i);
Sleep(1000);
}
DeleteCriticalSection( &g_critSec );
// Added 10/27/2017 to try to prevent the compiler to completely
// optimize out the code around g_shmem if it wouldn't be used anywhere.
std::cout << "Shared variable value: " << g_shmem << std::endl;
getchar();
return 0;
}
Update 10/27/2017 (1):
Some answers suggest that this is not a realistic test or does not represent a "real world" scenario. That's true, this test tries to measure the overhead of the std::mutex
, it's not trying to prove that the difference is negligible for 99% of applications.
Update 10/27/2017 (2):
Seems like the situation has changed in favor for std::mutex
since Visual Studio 2015 (VC140). I used VS2017 IDE, exactly the same code as above, x64 release configuration, optimizations disabled and I simply switched the "Platform Toolset" for each test. The results are very surprising and I am really curious what has hanged in VC140.
Update 02/25/2020 (3):
Reran the test with Visual Studio 2019 (Toolset v142), and situation is still the same: std::mutex
is two to three times faster than CRITICAL_SECTION
.
Solution 2:
The test by waldez here is not realistic, it basically simulates 100% contention. In general this is exactly what you don't want in multi-threaded code. Below is a modified test which does some shared calculations. The results I get with this code are different:
Tasks: 160000
Thread count: 1
std::mutex: 12096ms
CRITICAL_SECTION: 12060ms
Thread count: 2
std::mutex: 5206ms
CRITICAL_SECTION: 5110ms
Thread count: 4
std::mutex: 2643ms
CRITICAL_SECTION: 2625ms
Thread count: 8
std::mutex: 1632ms
CRITICAL_SECTION: 1702ms
Thread count: 12
std::mutex: 1227ms
CRITICAL_SECTION: 1244ms
You can see here that for me (using VS2013) the figures are very close between std::mutex and CRITICAL_SECTION. Note that this code does a fixed number of tasks (160,000) which is why the performance improves generally with more threads. I've got 12 cores here so that's why I stopped at 12.
I'm not saying this is right or wrong compared to the other test but it does highlight that timing issues are generally domain specific.
#include "stdafx.h"
#include <Windows.h>
#include <mutex>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>
const int tastCount = 160000;
int numThreads;
const int MAX_THREADS = 16;
double g_shmem = 8;
std::mutex g_mutex;
CRITICAL_SECTION g_critSec;
void sharedFunc(int i, double &data)
{
for (int j = 0; j < 100; j++)
{
if (j % 2 == 0)
data = sqrt(data);
else
data *= data;
}
}
void threadFuncCritSec() {
double lMem = 8;
int iterations = tastCount / numThreads;
for (int i = 0; i < iterations; ++i) {
for (int j = 0; j < 100; j++)
sharedFunc(j, lMem);
EnterCriticalSection(&g_critSec);
sharedFunc(i, g_shmem);
LeaveCriticalSection(&g_critSec);
}
printf("results: %f\n", lMem);
}
void threadFuncMutex() {
double lMem = 8;
int iterations = tastCount / numThreads;
for (int i = 0; i < iterations; ++i) {
for (int j = 0; j < 100; j++)
sharedFunc(j, lMem);
g_mutex.lock();
sharedFunc(i, g_shmem);
g_mutex.unlock();
}
printf("results: %f\n", lMem);
}
void testRound()
{
std::vector<std::thread> threads;
auto startMutex = std::chrono::high_resolution_clock::now();
for (int i = 0; i < numThreads; ++i)
threads.push_back(std::thread(threadFuncMutex));
for (std::thread& thd : threads)
thd.join();
auto endMutex = std::chrono::high_resolution_clock::now();
std::cout << "std::mutex: ";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(endMutex - startMutex).count();
std::cout << "ms \n\r";
threads.clear();
auto startCritSec = std::chrono::high_resolution_clock::now();
for (int i = 0; i < numThreads; ++i)
threads.push_back(std::thread(threadFuncCritSec));
for (std::thread& thd : threads)
thd.join();
auto endCritSec = std::chrono::high_resolution_clock::now();
std::cout << "CRITICAL_SECTION: ";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(endCritSec - startCritSec).count();
std::cout << "ms \n\r";
}
int _tmain(int argc, _TCHAR* argv[]) {
InitializeCriticalSection(&g_critSec);
std::cout << "Tasks: " << tastCount << "\n\r";
for (numThreads = 1; numThreads <= MAX_THREADS; numThreads = numThreads * 2) {
if (numThreads == 16)
numThreads = 12;
Sleep(100);
std::cout << "Thread count: " << numThreads << "\n\r";
testRound();
}
DeleteCriticalSection(&g_critSec);
return 0;
}
Solution 3:
I was searching here for pthread vs critical section benchmarks, however, as my result turned out to be different from the waldez's answer with regard to the topic, I thought it'd be interesting to share.
The code is the one used by @waldez, modified to add pthreads to the comparison, compiled with GCC and no optimizations. My CPU is AMD A8-3530MX.
Windows 7 Home Edition:
>a.exe
Iterations: 1000000
Thread count: 1
std::mutex: 46800us
CRITICAL_SECTION: 31200us
pthreads: 31200us
Thread count: 2
std::mutex: 171600us
CRITICAL_SECTION: 218400us
pthreads: 124800us
Thread count: 4
std::mutex: 327600us
CRITICAL_SECTION: 374400us
pthreads: 249600us
Thread count: 8
std::mutex: 967201us
CRITICAL_SECTION: 748801us
pthreads: 717601us
Thread count: 16
std::mutex: 2745604us
CRITICAL_SECTION: 1497602us
pthreads: 1903203us
As you can see, the difference varies well within statistical error — sometimes std::mutex is faster, sometimes it's not. What's important, I do not observe such big difference as the original answer.
I think, maybe the reason is that when the answer was posted, MSVC compiler wasn't good with newer standards, and note that the original answer have used the version from 2012 year.
Also, out of curiosity, same binary under Wine on Archlinux:
$ wine a.exe
fixme:winediag:start_process Wine Staging 2.19 is a testing version containing experimental patches.
fixme:winediag:start_process Please mention your exact version when filing bug reports on winehq.org.
Iterations: 1000000
Thread count: 1
std::mutex: 53810us
CRITICAL_SECTION: 95165us
pthreads: 62316us
Thread count: 2
std::mutex: 604418us
CRITICAL_SECTION: 1192601us
pthreads: 688960us
Thread count: 4
std::mutex: 779817us
CRITICAL_SECTION: 2476287us
pthreads: 818022us
Thread count: 8
std::mutex: 1806607us
CRITICAL_SECTION: 7246986us
pthreads: 809566us
Thread count: 16
std::mutex: 2987472us
CRITICAL_SECTION: 14740350us
pthreads: 1453991us
The waldez's code with my modifications:
#include <math.h>
#include <windows.h>
#include <mutex>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>
#include <pthread.h>
const int g_cRepeatCount = 1000000;
const int g_cThreadCount = 16;
double g_shmem = 8;
std::mutex g_mutex;
CRITICAL_SECTION g_critSec;
pthread_mutex_t pt_mutex;
void sharedFunc( int i )
{
if ( i % 2 == 0 )
g_shmem = sqrt(g_shmem);
else
g_shmem *= g_shmem;
}
void threadFuncCritSec() {
for ( int i = 0; i < g_cRepeatCount; ++i ) {
EnterCriticalSection( &g_critSec );
sharedFunc(i);
LeaveCriticalSection( &g_critSec );
}
}
void threadFuncMutex() {
for ( int i = 0; i < g_cRepeatCount; ++i ) {
g_mutex.lock();
sharedFunc(i);
g_mutex.unlock();
}
}
void threadFuncPTMutex() {
for ( int i = 0; i < g_cRepeatCount; ++i ) {
pthread_mutex_lock(&pt_mutex);
sharedFunc(i);
pthread_mutex_unlock(&pt_mutex);
}
}
void testRound(int threadCount)
{
std::vector<std::thread> threads;
auto startMutex = std::chrono::high_resolution_clock::now();
for (int i = 0; i<threadCount; ++i)
threads.push_back(std::thread( threadFuncMutex ));
for ( std::thread& thd : threads )
thd.join();
auto endMutex = std::chrono::high_resolution_clock::now();
std::cout << "std::mutex: ";
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endMutex - startMutex).count();
std::cout << "us \n";
g_shmem = 0;
threads.clear();
auto startCritSec = std::chrono::high_resolution_clock::now();
for (int i = 0; i<threadCount; ++i)
threads.push_back(std::thread( threadFuncCritSec ));
for ( std::thread& thd : threads )
thd.join();
auto endCritSec = std::chrono::high_resolution_clock::now();
std::cout << "CRITICAL_SECTION: ";
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endCritSec - startCritSec).count();
std::cout << "us \n";
g_shmem = 0;
threads.clear();
auto startPThread = std::chrono::high_resolution_clock::now();
for (int i = 0; i<threadCount; ++i)
threads.push_back(std::thread( threadFuncPTMutex ));
for ( std::thread& thd : threads )
thd.join();
auto endPThread = std::chrono::high_resolution_clock::now();
std::cout << "pthreads: ";
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endPThread - startPThread).count();
std::cout << "us \n";
g_shmem = 0;
}
int main() {
InitializeCriticalSection( &g_critSec );
pthread_mutex_init(&pt_mutex, 0);
std::cout << "Iterations: " << g_cRepeatCount << "\n";
for (int i = 1; i <= g_cThreadCount; i = i*2) {
std::cout << "Thread count: " << i << "\n";
testRound(i);
Sleep(1000);
}
getchar();
DeleteCriticalSection( &g_critSec );
pthread_mutex_destroy(&pt_mutex);
return 0;
}
Solution 4:
I'm using Visual Studio 2013.
My results in single threaded usage are looking similar to waldez results:
1 million of lock/unlock calls:
CRITICAL_SECTION: 19 ms
std::mutex: 48 ms
std::recursive_mutex: 48 ms
The reason why Microsoft changed implementation is C++11 compatibility. C++11 has 4 kind of mutexes in std namespace:
- std::mutex
- std::timed_mutex
- std::recursive_mutex
- std::recursive_timed_mutex
Microsoft std::mutex and all other mutexes are the wrappers around critical section:
struct _Mtx_internal_imp_t
{ /* Win32 mutex */
int type; // here MS keeps particular mutex type
Concurrency::critical_section cs;
long thread_id;
int count;
};
As for me, std::recursive_mutex should completely match critical section. So Microsoft should optimize its implementation to take less CPU and memory.