Why is boosts matrix multiplication slower than mine?
Slower performance of the uBLAS version can be partly explained by debugging features of the latter as was pointed out by TJD.
Here's the time taken by the uBLAS version with debugging on:
real 0m19.966s
user 0m19.809s
sys 0m0.112s
Here's the time taken by the uBLAS version with debugging off (-DNDEBUG -DBOOST_UBLAS_NDEBUG
compiler flags added):
real 0m7.061s
user 0m6.936s
sys 0m0.096s
So with debugging off, uBLAS version is almost 3 times faster.
Remaining performance difference can be explained by quoting the following section of uBLAS FAQ "Why is uBLAS so much slower than (atlas-)BLAS":
An important design goal of ublas is to be as general as possible.
This generality almost always comes with a cost. In particular the prod
function template can handle different types of matrices, such as sparse or triangular ones. Fortunately uBLAS provides alternatives optimized for dense matrix multiplication, in particular, axpy_prod and block_prod
. Here are the results of comparing different methods:
ijkalgorithm prod axpy_prod block_prod
1.335 7.061 1.330 1.278
As you can see both axpy_prod
and block_prod
are somewhat faster than your implementation. Measuring just the computation time without I/O, removing unnecessary copying and careful choice of the block size for block_prod
(I used 64) can make the difference more profound.
See also uBLAS FAQ and Effective uBlas and general code optimization.
I believe, your compiler doesn't optimize enough. uBLAS code makes heavy use of templates and templates require heavy use of optimizations. I ran your code through MS VC 7.1 compiler in release mode for 1000x1000 matrices, it gives me
10.064
s for uBLAS
7.851
s for vector
The difference is still there, but by no means overwhelming. uBLAS's core concept is lazy evaluation, so prod(A, B)
evaluates results only when needed, e.g. prod(A, B)(10,100)
will execute in no time, since only that one element will actually be calculated. As such there's actually no dedicated algorithm for whole matrix multiplication which could be optimized (see below). But you could help the library a little, declaring
matrix<int, column_major> B;
will reduce running time to 4.426
s which beats your function with one hand tied. This declaration makes access to memory more sequential when multiplying matrices, optimizing cache usage.
P.S. Having read uBLAS documentation to the end ;), you should have found out that there's actually a dedicated function to multiply whole matrices at once. 2 functions - axpy_prod
and opb_prod
. So
opb_prod(A, B, C, true);
even on unoptimized row_major B matrix executes in 8.091
sec and is on par with your vector algorithm
P.P.S. There's even more optimizations:
C = block_prod<matrix<int>, 1024>(A, B);
executes in 4.4
s, no matter whether B is column_ or row_ major.
Consider the description: "The function block_prod is designed for large dense matrices." Choose specific tools for specific tasks!