Efficient outer product in python

Outer product in python seems quite slow when we have to deal with vectors of dimension of order 10k. Could someone please give me some idea how could I speed up this operation in python?

Code is as follows:

 In [8]: a.shape
 Out[8]: (128,)

 In [9]: b.shape
 Out[9]: (32000,)

 In [10]: %timeit np.outer(b,a)
 100 loops, best of 3: 15.4 ms per loop

Since I have to do this operation several times, my code is getting slower.


Solution 1:

It doesn't really get any faster than that, these are your options:

numpy.outer

>>> %timeit np.outer(a,b)
100 loops, best of 3: 9.79 ms per loop

numpy.einsum

>>> %timeit np.einsum('i,j->ij', a, b)
100 loops, best of 3: 16.6 ms per loop

numba

from numba.decorators import autojit

@autojit
def outer_numba(a, b):
    m = a.shape[0]
    n = b.shape[0]
    result = np.empty((m, n), dtype=np.float)
    for i in range(m):
        for j in range(n):
            result[i, j] = a[i]*b[j]
    return result

>>> %timeit outer_numba(a,b)
100 loops, best of 3: 9.77 ms per loop

parakeet

from parakeet import jit

@jit
def outer_parakeet(a, b):
   ... same as numba

>>> %timeit outer_parakeet(a, b)
100 loops, best of 3: 11.6 ms per loop

cython

cimport numpy as np
import numpy as np
cimport cython
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
def outer_cython(np.ndarray[DTYPE_t, ndim=1] a, np.ndarray[DTYPE_t, ndim=1] b):
    cdef int m = a.shape[0]
    cdef int n = b.shape[0]
    cdef np.ndarray[DTYPE_t, ndim=2] result = np.empty((m, n), dtype=np.float64)
    for i in range(m):
        for j in range(n):
            result[i, j] = a[i]*b[j]
    return result

>>> %timeit outer_cython(a, b)
100 loops, best of 3: 10.1 ms per loop

theano

from theano import tensor as T
from theano import function

x = T.vector()
y = T.vector()

outer_theano = function([x, y], T.outer(x, y))

>>> %timeit outer_theano(a, b)
100 loops, best of 3: 17.4 ms per loop

pypy

# Same code as the `outer_numba` function
>>> timeit.timeit("outer_pypy(a,b)", number=100, setup="import numpy as np;a = np.random.rand(128,);b = np.random.rand(32000,);from test import outer_pypy;outer_pypy(a,b)")*1000 / 100.0
16.36 # ms

Conclusions:

╔═══════════╦═══════════╦═════════╗
║  method   ║ time(ms)* ║ version ║
╠═══════════╬═══════════╬═════════╣
║ numba     ║ 9.77      ║ 0.16.0  ║
║ np.outer  ║ 9.79      ║ 1.9.1   ║
║ cython    ║ 10.1      ║ 0.21.2  ║
║ parakeet  ║ 11.6      ║ 0.23.2  ║
║ pypy      ║ 16.36     ║ 2.4.0   ║
║ np.einsum ║ 16.6      ║ 1.9.1   ║
║ theano    ║ 17.4      ║ 0.6.0   ║
╚═══════════╩═══════════╩═════════╝
* less time = faster

Solution 2:

@elyase's answer is great, and rightly accepted. Here's an additional suggestion that, if you can use it, might make the call to np.outer even faster.

You say "I have to do this operation several times", so it is possible that you can reuse the array that holds the outer product, instead of allocating a new one each time. That can give a nice boost in performance.

First, some random data to work with:

In [32]: a = np.random.randn(128)

In [33]: b = np.random.randn(32000)

Here's the baseline timing for np.outer(a, b) on my computer:

In [34]: %timeit np.outer(a, b)
100 loops, best of 3: 5.52 ms per loop

Suppose we're going to repeat that operation several times, with arrays of the same shape. Create an out array to hold the result:

In [35]: out = np.empty((128, 32000))

Now use out as the third argument of np.outer:

In [36]: %timeit np.outer(a, b, out)
100 loops, best of 3: 2.38 ms per loop

So you get a nice performance boost if you can reuse the array that holds the outer product.

You get a similar benefit if you use the out argument of einsum, and in the cython function if you add a third argument for the output instead of allocating it in the function with np.empty. (The other compiled/jitted codes in @elyase's answer will probably benefit from this, too, but I only tried the cython version.)

Nota bene! The benefit shown above might not be realized in practice. The out array fits in the L3 cache of my CPU, and when it is used in the loop performed by the timeit command, it likely remains in the cache. In practice, the array might be moved out of the cache between calls to np.outer. In that case, the improvement isn't so dramatic, but it should still be at least the cost of a call to np.empty(), i.e.

In [53]: %timeit np.empty((128, 32000))
1000 loops, best of 3: 1.29 ms per loop

Solution 3:

It should be as simple as using numpy.outer(): a single function call which will be implemented in C for high performance.