Why is numpy's einsum faster than numpy's built in functions?

First off, there's been a lot of past discussion about this on the numpy list. For example, see: http://numpy-discussion.10968.n7.nabble.com/poor-performance-of-sum-with-sub-machine-word-integer-types-td41.html http://numpy-discussion.10968.n7.nabble.com/odd-performance-of-sum-td3332.html

Some of boils down to the fact that einsum is new, and is presumably trying to be better about cache alignment and other memory access issues, while many of the older numpy functions focus on a easily portable implementation over a heavily optimized one. I'm just speculating, there, though.


However, some of what you're doing isn't quite an "apples-to-apples" comparison.

In addition to what @Jamie already said, sum uses a more appropriate accumulator for arrays

For example, sum is more careful about checking the type of the input and using an appropriate accumulator. For example, consider the following:

In [1]: x = 255 * np.ones(100, dtype=np.uint8)

In [2]: x
Out[2]:
array([255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255], dtype=uint8)

Note that the sum is correct:

In [3]: x.sum()
Out[3]: 25500

While einsum will give the wrong result:

In [4]: np.einsum('i->', x)
Out[4]: 156

But if we use a less limited dtype, we'll still get the result you'd expect:

In [5]: y = 255 * np.ones(100)

In [6]: np.einsum('i->', y)
Out[6]: 25500.0

Now that numpy 1.8 is released, where according to the docs all ufuncs should use SSE2, I wanted to double check that Seberg's comment about SSE2 was valid.

To perform the test a new python 2.7 install was created- numpy 1.7 and 1.8 were compiled with icc using standard options on a AMD opteron core running Ubuntu.

This is the test run both before and after the 1.8 upgrade:

import numpy as np
import timeit

arr_1D=np.arange(5000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)

print 'Summation test:'
print timeit.timeit('np.sum(arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("ijk->", arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'


print 'Power test:'
print timeit.timeit('arr_3D*arr_3D*arr_3D',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("ijk,ijk,ijk->ijk", arr_3D, arr_3D, arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'


print 'Outer test:'
print timeit.timeit('np.outer(arr_1D, arr_1D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("i,k->ik", arr_1D, arr_1D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'


print 'Einsum test:'
print timeit.timeit('np.sum(arr_2D*arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("ij,oij->", arr_2D, arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'

Numpy 1.7.1:

Summation test:
0.172988510132
0.0934836149216
----------------------

Power test:
1.93524689674
0.839519000053
----------------------

Outer test:
0.130380821228
0.121401786804
----------------------

Einsum test:
0.979052495956
0.126066613197

Numpy 1.8:

Summation test:
0.116551589966
0.0920487880707
----------------------

Power test:
1.23683619499
0.815982818604
----------------------

Outer test:
0.131808176041
0.127472200394
----------------------

Einsum test:
0.781750011444
0.129271841049

I think this is fairly conclusive that SSE plays a large role in the timing differences, it should be noted that repeating these tests the timings very by only ~0.003s. The remaining difference should be covered in the other answers to this question.