Improving FFT performance in Python

What is the fastest FFT implementation in Python?

It seems numpy.fft and scipy.fftpack both are based on fftpack, and not FFTW. Is fftpack as fast as FFTW? What about using multithreaded FFT, or using distributed (MPI) FFT?


You could certainly wrap whatever FFT implementation that you wanted to test using Cython or other like-minded tools that allow you to access external libraries.

GPU-based

If you're going to test FFT implementations, you might also take a look at GPU-based codes (if you have access to the proper hardware). There are several: reikna.fft, scikits.cuda.

CPU-based

There's also a CPU based python FFTW wrapper pyFFTW.

(There is pyFFTW3 as well, but it is not so actively maintained as pyFFTW, and it does not work with Python3. (source))

I don't have experience with any of these. It's probably going to fall to you to do some digging around and benchmark different codes for your particular application if speed is important to you.


For a test detailed at https://gist.github.com/fnielsen/99b981b9da34ae3d5035 I find that scipy.fftpack performs fine compared to my simple application of pyfftw via pyfftw.interfaces.scipy_fftpack, except for data with a length corresponding to a prime number.

There seems to be some setup cost associated with evoking pyfftw.interfaces.scipy_fftpack.fft the first time. The second time it is faster. Numpy's and scipy's fftpack with a prime number performs terribly for the size of data I tried. CZT is faster in that case. Some months ago an issue was put up at Scipy's Github about the problem, see https://github.com/scipy/scipy/issues/4288

20000 prime=False
  padded_fft : 0.003116
   numpy_fft : 0.003502
   scipy_fft : 0.001538
         czt : 0.035041
    fftw_fft : 0.004007
------------------------------------------------------------
20011 prime=True
  padded_fft : 0.001070
   numpy_fft : 1.263672
   scipy_fft : 0.875641
         czt : 0.033139
    fftw_fft : 0.009980
------------------------------------------------------------
21803 prime=True
  padded_fft : 0.001076
   numpy_fft : 1.510341
   scipy_fft : 1.043572
         czt : 0.035129
    fftw_fft : 0.011463
------------------------------------------------------------
21804 prime=False
  padded_fft : 0.001108
   numpy_fft : 0.004672
   scipy_fft : 0.001620
         czt : 0.033854
    fftw_fft : 0.005075
------------------------------------------------------------
21997 prime=True
  padded_fft : 0.000940
   numpy_fft : 1.534876
   scipy_fft : 1.058001
         czt : 0.034321
    fftw_fft : 0.012839
------------------------------------------------------------
32768 prime=False
  padded_fft : 0.001222
   numpy_fft : 0.002410
   scipy_fft : 0.000925
         czt : 0.039275
    fftw_fft : 0.005714
------------------------------------------------------------

The pyFFTW3 package is inferior compared to the pyFFTW library, at least implementation wise. Since they both wrap the FFTW3 library I guess speed should be the same.

https://pypi.python.org/pypi/pyFFTW


Where I work some researchers have compiled this Fortran library which setups and calls the FFTW for a particular problem. This Fortran library (module with some subroutines) expect some input data (2D lists) from my Python program.

What I did was to create a little C-extension for Python wrapping the Fortran library, where I basically calls "init" to setup a FFTW planner, and another function to feed my 2D lists (arrays), and a "compute" function.

Creating a C-extensions is a small task, and there a lot of good tutorials out there for that particular task.

To good thing about this approach is that we get speed .. a lot of speed. The only drawback is in the C-extension where we must iterate over the Python list, and extract all the Python data into a memory buffer.


The FFTW site shows fftpack running about 1/3 as fast as FFTW, but that's with a mechanically translated Fortran-to-C step followed by C compilation, and I don't know if numpy/scipy uses a more direct Fortran compilation. If performance is critical to you, you might consider compiling FFTW into a DLL/shared library and using ctypes to access it, or building a custom C extension.