Why are there so many OpenBLAS packages and which one would yield fastest results?
In Ubuntu 20.04, there are many packages for OpenBLAS.
~$ apt search openblas
p libopenblas-base - Optimized BLAS (linear algebra) library (transitional)
p libopenblas-dev - Optimized BLAS (linear algebra) library (dev, meta)
p libopenblas-openmp-dev - Optimized BLAS (linear algebra) library (dev, openmp)
p libopenblas-pthread-dev - Optimized BLAS (linear algebra) library (dev, pthread)
p libopenblas-serial-dev - Optimized BLAS (linear algebra) library (dev, serial)
i A libopenblas0 - Optimized BLAS (linear algebra) library (meta)
p libopenblas0-openmp - Optimized BLAS (linear algebra) library (shared lib, openmp)
i A libopenblas0-pthread - Optimized BLAS (linear algebra) library (shared lib, pthread)
p libopenblas0-serial - Optimized BLAS (linear algebra) library (shared lib, serial)
p libopenblas64-0 - Optimized BLAS (linear algebra) library (shared lib, 64bit, meta)
p libopenblas64-0-openmp - Optimized BLAS (linear algebra) library (shared lib, 64bit, openmp)
p libopenblas64-0-pthread - Optimized BLAS (linear algebra) library (shared lib, 64bit, pthread)
p libopenblas64-0-serial - Optimized BLAS (linear algebra) library (shared lib, 64bit, serial)
p libopenblas64-dev - Optimized BLAS (linear algebra) library (dev, 64bit, meta)
p libopenblas64-openmp-dev - Optimized BLAS (linear algebra) library (dev, 64bit, openmp)
p libopenblas64-pthread-dev - Optimized BLAS (linear algebra) library (dev, 64bit, pthread)
p libopenblas64-serial-dev - Optimized BLAS (linear algebra) library (dev, 64bit, serial)
Which of these packages would yield fastest results?
I intend to do numerical computation (mostly diagonalizing matrices) in GNU Octave. My computer has a Intel Core i3-5005U processor, (in case the optimized package should depend on processor type, please mention which package should be preferred on other types of processors).
I have noticed that there is at least 10x speed improvement when OpenBLAS is used instead of default BLAS.
Solution 1:
The answer to "why?" question may be - to get universal solution for many CPUs and platforms.
Technically all these binary packages came from the same openblas
source package.
If we talk about provided library variants for update-alternatives, then after sudo apt-get install "*openblas*"
we can count 4 groups with 4 choices:
$ sudo update-alternatives --config libopenblas<Tab> libopenblas64.so.0-x86_64-linux-gnu libopenblas64.so-x86_64-linux-gnu libopenblas.so.0-x86_64-linux-gnu libopenblas.so-x86_64-linux-gnu
After installation the pthread is set to default (0 selection) version.
For basic benchmarking we can utilize our old mkl-test.sh
script for different alternatives for openblas
libraries using update-alternatives
.
Below are results for my i7-3537U, for third run (lower is faster, all results are in seconds):
Alt library | Scilab | Julia | Python 3 with NumPy | R | Octave |
---|---|---|---|---|---|
pthread |
0.31 | 0.76 | 0.31 | 0.39 | 0.31 |
openmp |
0.24 | 0.75 | 0.22 | 0.31 | 0.22 |
serial |
0.17 | 0.79 | 0.17 | 0.27 | 0.17 |
atlas/liblapack | 0.31 | 0.75 | 0.32 | 0.52 | 0.32 |
lapack/liblapack | 0.26 | 0.76 | 0.30 | 0.47 | 0.28 |
libmkl_rt (MKL) | 0.16 | 0.76 | 0.16 | 0.22 | 0.16 |
Better way is to run official benchmarks, but I can't currently understand how to run them.
Solution 2:
Why are there so many?
- The -dev libraries include headers and libraries for linking against
- The libopenblas0* libraries have 32-bit (ie, uint32_t) arguments
- The libopenblas64-0* libraries have 64-bit (ie, uint64_t) arguments.
- For example in
zgetrf()
, if your array isuint32_t *ip
then it is not suitable for using the 64 version because your memory alignment is 4-bytes between elements (uint32_t
) instead of 8-bytes. You would have to refactor your code to support 8-byte alignment between array elements (uint64_t
). - It will crash out of bounds if you use 64 but are setup for 32.
- If you use 32 and are setup for 64 it might not go out of bounds, but you won't get a correct result.
- For example in
Benchmarks
I'm integrating LAPACK, ATLAS, and OpenBLAS support into the xnec2c EM simulator and have the following benchmarks, specifically against zgetrf() and zgetrs() used in place of the existing NEC2 Gaussian Elimination algorithm. Be sure to run your own benchmarks for your application, results will vary.
In Ubuntu 20.04 we selected the libraries as follows:
~# update-alternatives --config libblas.so.3-x86_64-linux-gnu
~# update-alternatives --config liblapack.so.3-x86_64-linux-gnu
In CentOS/RHEL you have the following libraries and can access them directly without needing to use alternatives
:
- libopenblas.so: serial
- libopenblaso.so: OpenMP
- libopenblasp.so: pthreads
Here are the numbers copy-pasted from my notepad while testing Ubuntu 20.04. The first two lines of each section define what was selected for the test. These tests were run on a CentOS 7 VM under KVM with 24 vCPUs running on a dual-socket Xeon E5-2450 v2 server with 32 logical processors: 16 cores, 32 threads. The hypervisor was lightly loaded and shouldn't interfere with the numbers too much, but there is bound to be some jitter.
Default LAPACK+BLAS:
/usr/lib/x86_64-linux-gnu/blas/libblas.so.3
/usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3
Serial:
Frequency_Loop elapsed time: 17.540371 seconds
Frequency_Loop elapsed time: 16.697984 seconds
Frequency_Loop elapsed time: 15.621345 seconds
Frequency_Loop elapsed time: 15.515307 seconds
Using ATLAS
/usr/lib/x86_64-linux-gnu/atlas/libblas.so.3
/usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3
Serial:
Frequency_Loop elapsed time: 12.882587 seconds
Frequency_Loop elapsed time: 12.233791 seconds
Frequency_Loop elapsed time: 12.828287 seconds
Frequency_Loop elapsed time: 12.607457 seconds
Mixing ATLAS's BLAS with OpenBLAS serial
/usr/lib/x86_64-linux-gnu/atlas/libblas.so.3
/usr/lib/x86_64-linux-gnu/openblas-serial/liblapack.so.3
Serial:
Frequency_Loop elapsed time: 11.757070 seconds
Frequency_Loop elapsed time: 11.566754 seconds
OpenBLAS Serial
/usr/lib/x86_64-linux-gnu/openblas-serial/libblas.so.3
/usr/lib/x86_64-linux-gnu/openblas-serial/liblapack.so.3
Serial:
Frequency_Loop elapsed time: 11.345475 seconds
Frequency_Loop elapsed time: 12.047305 seconds
Frequency_Loop elapsed time: 11.693541 seconds
OpenBLAS Serial LAPACK with OpenMP BLAS
/usr/lib/x86_64-linux-gnu/openblas-openmp/libblas.so.3
/usr/lib/x86_64-linux-gnu/openblas-serial/liblapack.so.3
Serial (or barely threaded, 101%)
Frequency_Loop elapsed time: 11.049351 seconds
Frequency_Loop elapsed time: 11.756581 seconds
OpenBLAS OpenMP
/usr/lib/x86_64-linux-gnu/openblas-openmp/libblas.so.3
/usr/lib/x86_64-linux-gnu/openblas-openmp/liblapack.so.3
Threaded (~400% cpu)
Frequency_Loop elapsed time: 8.079269 seconds
Frequency_Loop elapsed time: 8.119229 seconds
Frequency_Loop elapsed time: 8.329753 seconds
OpenBLAS OpenMP LAPACK with default BLAS
/usr/lib/x86_64-linux-gnu/blas/libblas.so.3
/usr/lib/x86_64-linux-gnu/openblas-openmp/liblapack.so.3
Frequency_Loop elapsed time: 8.161807 seconds
Frequency_Loop elapsed time: 8.009399 seconds
OpenBLAS pthreads with different thread limits.
Note that limiting the threads reduces thread contention and increases performance in some cases:
/usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
/usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
Threaded 1400% cpu
Frequency_Loop elapsed time: 13.181950 seconds
Frequency_Loop elapsed time: 12.866588 seconds
OPENBLAS_NUM_THREADS=4
Frequency_Loop elapsed time: 9.567861 seconds
OPENBLAS_NUM_THREADS=8
Frequency_Loop elapsed time: 8.767348 seconds
OPENBLAS_NUM_THREADS=16
Frequency_Loop elapsed time: 9.818271 seconds
Results
OpenBLAS's OpenMP appears to perform best for this example. However, if your code is well partitioned you might benefit from fork()ing or using pthreads to run completely parallel. Some balance of forked jobs and _NUM_THREADS for single-job parallelism may work well---but combining forking with OpenBLAS or ATLAS's threading will cause contention.
For example, xnec2c supports a -jNN option for the number of frequency jobs to run. For many frequencies, it is often fastest to run the serial LAPACK versions in completely fork()ed parallelism (one job per frequency), rather than run fewer parallel forked jobs with more OpenBLAS/ATLAS threads because matrix operations will often have a mostly-serial reduction phase that cannot be parallelized. (See Amdahl's Law.)
Side comment: ATLAS will auto-tune for your CPU if you recompile it for your host. OpenBLAS might do some of that too, not sure.