Is there a performance difference between Numpy and Pandas?

I've written a bunch of code on the assumption that I was going to use Numpy arrays. Turns out the data I am getting is loaded through Pandas. I remember now that I loaded it in Pandas because I was having some problems loading it in Numpy. I believe the data was just too large.

Therefore I was wondering, is there a difference in computational ability when using Numpy vs Pandas?

If Pandas is more efficient then I would rather rewrite all my code for Pandas but if there is no more efficiency then I'll just use a numpy array...

There can be a significant performance difference, of an order of magnitude for multiplications and multiple orders of magnitude for indexing a few random values.

I was actually wondering about the same thing and came across this interesting comparison: http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/

I think it's more about using the two strategically and shifting data around (from numpy to pandas or vice versa) based on the performance you see. As a recent example, I was trying to concatenate 4 small pickle files with 10k rows each data.shape -> (10,000, 4) using numpy.

Code was something like:

n_concat = np.empty((0,4))
for file_path in glob.glob('data/0*', recursive=False):
    n_data = joblib.load(file_path)
    n_concat = np.vstack((co_np, filtered_snp))
joblib.dump(co_np, 'data/save_file.pkl', compress = True)

This crashed my laptop (8 GB, i5) which was surprising since the volume wasn't really that huge. The 4 compressed pickled files were roughly around 5 MB each.

The same thing, worked great on pandas.

for file_path in glob.glob('data/0*', recursive=False):
    n_data = joblib.load(sd)
    try:
        df = pd.concat([df, pd.DataFrame(n_data, columns = [...])])
    except NameError:
        df = pd.concat([pd.DataFrame(n_data,columns = [...])])
joblib.dump(df, 'data/save_file.pkl', compress = True)

One the other hand, when I was implementing gradient descent by iterating over a pandas data frame, it was horribly slow, while using numpy for the job was much quicker.

In general, I've seen that pandas usually works better for moving around/munging moderately large chunks of data and doing common column operations while numpy works best for vectorized and recursive work (maybe more math intense work) over smaller sets of data.

Moving data between the two is hassle free, so I guess, using both strategically is the way to go.

In my experiments on large numeric data, Pandas is consistently 20 TIMES SLOWER than Numpy. This is a huge difference, given that only simple arithmetic operations were performed: slicing of a column, mean(), searchsorted() - see below. Initially, I thought Pandas was based on numpy, or at least its implementation was C optimized just like numpy's. These assumptions turn out to be false, though, given the huge performance gap.

In examples below, data is a pandas frame with 8M rows and 3 columns (int32, float32, float32), without NaN values, column #0 (time) is sorted. data_np was created as data.values.astype('float32'). Results on Python 3.8, Ubuntu:

A. Column slices and mean():

# Pandas 
%%timeit 
x = data.x 
for k in range(100): x[100000:100001+k*100].mean() 

15.8 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Numpy
%%timeit 
for k in range(100): data_np[100000:100001+k*100,1].mean() 

874 µs ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Pandas is 18 times slower than Numpy (15.8ms vs 0.874 ms).

B. Search in a sorted column:

# Pandas
%timeit data.time.searchsorted(1492474643)                                                                                                                                                               
20.4 µs ± 920 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Numpy
%timeit data_np[0].searchsorted(1492474643)                                                                                                                                                              
1.03 µs ± 3.55 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Pandas is 20 times slower than Numpy (20.4µs vs 1.03µs).

EDIT: I implemented a namedarray class that bridges the gap between Pandas and Numpy in that it is based on Numpy's ndarray class and hence performs better than Pandas (typically ~7x faster) and is fully compatible with Numpy'a API and all its operators; but at the same time it keeps column names similar to Pandas' DataFrame, so that manipulating on individual columns is easier. This is a prototype implementation. Unlike Pandas, namedarray does not allow for different data types for columns. The code can be found here: https://github.com/mwojnars/nifty/blob/master/math.py (search "namedarray").

Is there a performance difference between Numpy and Pandas?

Related

Recent Posts