In an OpenMP parallel code, would there be any benefit for memset to be run in parallel?

People in HPC usually say that one thread is usually not enough to saturate a single memory link, the same usually being true for network links as well. Here is a quick and dirty OpenMP enabled memsetter I wrote for you that fills with zeros twice 2 GiB of memory. And here are the results using GCC 4.7 with different number of threads on different architectures (max values from several runs reported):

GCC 4.7, code compiled with -O3 -mtune=native -fopenmp:

Quad-socket Intel Xeon X7350 - pre-Nehalem quad-core CPU with separate memory controller and Front Side Bus

single socket

threads   1st touch      rewrite
1         1452.223 MB/s  3279.745 MB/s
2         1541.130 MB/s  3227.216 MB/s
3         1502.889 MB/s  3215.992 MB/s
4         1468.931 MB/s  3201.481 MB/s

(1st touch is slow as the thread team is being created from scratch and the operating system is mapping physical pages into the virtual address space reserved by malloc(3))

One thread already saturates the memory bandwidth of a single CPU <-> NB link. (NB = North Bridge)

1 thread per socket

threads   1st touch      rewrite
1         1455.603 MB/s  3273.959 MB/s
2         2824.883 MB/s  5346.416 MB/s
3         3979.515 MB/s  5301.140 MB/s
4         4128.784 MB/s  5296.082 MB/s

Two threads are necessary to saturate the full memory bandwidth of the NB <-> memory link.

Octo-socket Intel Xeon X7550 - 8-way NUMA system with octo-core CPUs (CMT disabled)

single socket

threads   1st touch      rewrite
1         1469.897 MB/s  3435.087 MB/s
2         2801.953 MB/s  6527.076 MB/s
3         3805.691 MB/s  9297.412 MB/s
4         4647.067 MB/s  10816.266 MB/s
5         5159.968 MB/s  11220.991 MB/s
6         5330.690 MB/s  11227.760 MB/s

At least 5 threads are necessary in order to saturate the bandwidth of one memory link.

1 thread per socket

threads   1st touch      rewrite
1         1460.012 MB/s  3436.950 MB/s
2         2928.678 MB/s  6866.857 MB/s
3         4408.359 MB/s  10301.129 MB/s
4         5859.548 MB/s  13712.755 MB/s
5         7276.209 MB/s  16940.793 MB/s
6         8760.900 MB/s  20252.937 MB/s

Bandwidth scales almost linearly with the number of threads. Based on the single-socket observations one could say that at least 40 threads distributed as 5 threads per socket would be necessary in order to saturate all of the eight memory links.

Basic problem on NUMA systems is the first-touch memory policy - memory is allocated on the NUMA node where the thread first to touch a virtual address within a specific page executes. Thread pinning (binding to specific CPU cores) is essential on such systems as thread migration leads to remote access, which is slower. Supported for pinnig is available in most OpenMP runtimes. GCC with its libgomp has the GOMP_CPU_AFFINITY environment variable, Intel has the KMP_AFFINITY environment variable, etc. Also, OpenMP 4.0 introduced the vendor-neutral concept of places.

Edit: For completeness, here are the results of running the code with an 1 GiB array on MacBook Air with Intel Core i5-2557M (dual-core Sandy Bridge CPU with HT and QPI). Compiler is GCC 4.2.1 (Apple LLVM build)

threads   1st touch      rewrite
1         2257.699 MB/s  7659.678 MB/s
2         3282.500 MB/s  8157.528 MB/s
3         4109.371 MB/s  8157.335 MB/s
4         4591.780 MB/s  8141.439 MB/s

Why this high speed with even a single thread? A little exploration with gdb shows that memset(buf, 0, len) gets translated by the OS X compiler to bzero(buf, len) and that an SSE4.2 enabled vectorised version by the name of bzero$VARIANT$sse42 is provided by libc.dylib and used at run-time. It uses the MOVDQA instruction to zero 16 bytes of memory at once. That's why even with one thread the memory bandwidth is almost saturated. A single-threaded AVX enabled version using VMOVDQA can zero 32 bytes at once and probably saturate the memory link.

The important message here is that sometimes vectorisation and multithreading are not orthogonal in bringing speed-up to the operation.


Well, there's always the L3 cache...

However, it's very likely that this will be bound by main memory bandwidth already; adding more parallelism is unlikely to improve things.