Synchronizing threads in multithreaded applications

I use the SIESTA dft package on a CentOS 8 (core) system with a 16 core, 32 thread XEON processor with OpenMPI version 4.1.1 for all calculations.

  1. Since I have 32 threads, I use 28 of them to do a SIESTA calculation (which consumes a good amount of the memory ~60%) and keep the remaining 4 free.

  2. However, if I start using 2 or 3 of the remaining threads for some other application (which has negligible memory usage), while maintaining the SIESTA calculation at 28 threads, I see that the speed of the SIESTA calculation is decreased by around 50-60%.

  3. I have checked the CPU utilization and I see that one thread remains almost idle when using the system in scenario 2.

Is there a way to diagnose and solve this problem? Does this happen because of some process scheduling error? Can some sort of process binding or job scheduling package be used to improve this?


Solution 1:

CPU utilization as a simple % cannot convey the complexity of a multiple core, multiple thread, multiple execution unit CPU and memory. Almost certainly CPU is actually stalled on memory or cache. And processes that do have their data will be fighting over execution units.


This CPU only has 16 cores. Treating it like it has 32 will at some point degrade performance severely, as you discovered. Even with SMT 2. Maybe you can get the number of threads to 125% of cores (20) but 175% (28) is pushing it. Especially with other things running. Back down the threads.

Be sure to calculate useful work done per thread per second. Experiment, changing one variable at a time. Maybe try processors with different cache and core count configurations, if you have access to those.


Measure how stalled you are with performance monitoring counters. Won't work in a VM, but worth a try on Linux. From Gregg which I linked earlier:

perf stat -a -- sleep 10

Theoretical top speed on Xeons is 4 or 5 instructions per cycle. You won't get that, but < 1.0 IPC is extra stalled on memory.


Definitely get an understanding of the application's code and hot spots. What functions spend most of the time on CPU? What assembly code gets hit the hardest? Which execution units on your CPU in particular are working the hardest to process these uops?

Flame graphs are nice for visualizing on CPU functions. You mentioned EL 8, which has packaged flamegraph tooling.

yum install perf js-d3-flame-graph
# system wide, 99 Hz, for 60 seconds
perf script flamegraph -a -F 99 sleep 60 

A developer level understanding of the program is necessary to fully interpret the results. With symbols or source code, perf reports can be annotated in a debugger like experience.