What parts of a Numpy-heavy function can I accelerate with Cython
Introductory notes: trying to accelerate Python+Numpy code with Cython is a common problem and this question is an attempt to create a canonical question about what types of operation you can accelerate effectively. Although I try to illustrate with a specific example, it is meant as an illustration - please don't focus too much on the fairly meaningless example.
Also, I've contributed enough to Cython that I should declare an affiliation (given that I'm bringing the topic up)
Actual question
Suppose I have a function that tries to do numeric calculations on Numpy arrays. It uses fairly typical operations:
- a loop over array elements that can't easily be vectorized
- calls to Numpy/Scipy functions (in this case
np.sin
). - Mathematical operations on the whole array (
a-b
)
import numpy as np
def some_func(a, b):
"""
a and b are 1D arrays
This is intended to be illustrative! Please don't focus on what it
actually does!
"""
transformed_a = np.zeros_like(a)
last = 0
for n in range(1, a.shape[0]):
an = a[n]
if an > 0:
delta = an - a[n-1]
transformed_a[n] = delta*last
else:
last = np.sin(an)
return transformed_a * b
a = np.random.randn(100)
b = np.linspace(0, 100, a.shape[0])
print(some_func(a, b))
Can I speed this up with Cython, and which parts would I expect to be able to speed up?
Solution 1:
Indexing individual array elements
This the main type of code where Cython can really help you. Indexing an individual element (e.g. an = a[n]
) can be a fairly slow operation in Python. Partly because Python is not a hugely quick language and so running Python code lots of times within a loop can be slow, and partly because the array is stored as a tightly-packed array of C floats, but the indexing operation needs to return a Python object. Therefore, indexing a Numpy array requires new Python objects to be allocated.
In Cython you can declare the arrays as typed memoryviews, or as np.ndarray
. (Typed memoryviews are the more modern approach and you should usually prefer them). Doing so allows you to directly access the tightly-packed C array directly and retrieve the C value, without creating a Python object.
The directives cython.boundscheck
and cython.wraparound
can be very worthwhile for further speed-ups to the index (but remember they do remove useful features, so think before using them).
vs vectorization
A lot of the time a loop over a Numpy array can be written as a vectorized operation - something that acts on the whole array at once. It is usually a good idea to write Python+Numpy code like this. If you have multiple chained vectorized operations, it is sometimes worthwhile to write it explicitly as a Cython loop to avoid allocating intermediate arrays.
Alternatively the little-known Pythran backend for Cython convert a set of vectorized Numpy operations to optimized C++ code.
Indexing array slices
Isn't a problem in Cython, but typically isn't something that will get you significant speed-ups alone.
Calling Numpy functions
e.g. last = np.sin(an)
These require a Python call, so Cython typically cannot accelerate these - it has no visibility into the contents of the Numpy function.
However, here the operation is on a single value, and not on a Numpy array. In this case we can use sin
from the C standard library, which will be significantly quicker than a Python function call. You'd do from libc.math cimport sin
and call sin
rather than np.sin
.
Numba is an alternative Python accelerator that has better visibility into Numpy functions can can often optimize without changes.
Array allocations
e.g. transformed_a = np.zeros_like(a)
.
This is just a Numpy function call and thus Cython has no ability to speed it up. If it's just an intermediate value to be returned to Python then you might consider a fixed-size C array on the stack
cdef double transformed_a[10] # note - you must know the size at compile-time
or by allocating them with the C malloc
function (remember to free
it). Or by using Cython's cython.view.array
(which is still a Python object, but can be a little quicker).
Whole-array arithmetic
e.g. transformed_a * b
, which multiplies transformed_a
and b
element by element.
Cython doesn't help you here - it's just a disguised function call (although Pythran+Cython may have some benefits). For large arrays this kind of operation is pretty efficient in Numpy so don't overthink it.
Note that whole-array operations aren't defined for Cython typed memoryviews, so you need to do np.asarray(memview)
to get them back to Numpy arrays. This typically doesn't need a copy and is quick.
For some operations like this, you can use BLAS
and LAPACK
functions (which are fast C implementations of array and matrix operations). Scipy provies a Cython interface for them (https://docs.scipy.org/doc/scipy/reference/linalg.cython_blas.html) to use. They're a little more complex to use that the natural Python code.
The illustrative example
Just for completeness, I'd write it something like:
import numpy as np
from libc.math cimport sin
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def some_func(double[::1] a, b):
cdef double[::1] transformed_a = np.zeros_like(a)
cdef double last = 0
cdef double an, delta
cdef Py_ssize_t n
for n in range(1, a.shape[0]):
an = a[n]
if an > 0:
delta = an - a[n-1]
transformed_a[n] = delta*last
else:
last = sin(an)
return np.asarray(transformed_a) * b
which is a little over 10x faster.
cython -a
is helpful here - it produces an annotated HTML file that shows which lines contain a lot of interaction with Python.