Why having more and faster cores makes my multithreaded software slower?

Solution 1:

Sure looks like a NUMA effect when multiple sockets degrades performance drastically.

perf is very useful. Already in the perf report, you can see native_queued_spin_lock_slowpath taking 35%, which seems like a very large amount of overhead for your concurrency code. The tricky part is visualizing what is calling what, if you don't know the concurrency code extremely well.

I would recommend making flame graphs out of system wide CPU sampling. Quick start:

git clone https://github.com/brendangregg/FlameGraph  # or download it from github
cd FlameGraph
perf record -F 99 -a -g -- sleep 60
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf-kernel.svg

In the resulting graphic, look for the tallest "plateaus". Which indicate functions with the most exclusive time.

I look forward to when the bpfcc-tools package is in Debian stable, it will enable collection of these "folded" stacks directly with less overhead.

What you do with this depends on what you find. Know what critical section is being protected by a lock. Compare to existing research into scalable synchronization on modern hardware. For example, a Concurrency Kit presentation notes that different spinlock implementations have different properties.

Solution 2:

I would dare saying this is a hardware "issue". You overload the IO subsystem and it is of this kings that more paralellism makes it slower (like discs).

THe main indications are:

  • ~100 threads for IO
  • You say nothing about IO. That is typical an area inexperienced people overlook and never talk about. Typical for databases "oh, i have that much ram, but i don't tell you I run from a slow large capacity disc, why am I slow".

Solution 3:

Because software manufacturers are mostly too lazy to make multi-core optimizations.

Software designers rarely design software that can use the full hardware capabilities of a system. Some very well written software can be considered good is the coin-mining software, since many of them are able to use the video card's processing power near it's maximum level (unlike the games, which never get close to utilizing the true processing power of a GPU).

A similar thing is valid for quite a lot of software now-days. They never bother to do multi-core optimizations, therefore performance will be better when running that software will less cores set at higher speed compared to more lower speed cores. In the case of more and faster cores, that cannot be an advantage all the time for the same reason: poorly written code. The program will try to split it's sub-tasks across too many cores and that will actually delay overall processing.