When to use pandas series, numpy ndarrays or simply python dictionaries?

Solution 1:

The rule of thumb that I usually apply: use the simplest data structure that still satisfies your needs. If we rank the data structures from most simple to least simple, it usually ends up like this:

  1. Dictionaries / lists
  2. Numpy arrays
  3. Pandas series / dataframes

So first consider dictionaries / lists. If these allow you to do all data operations that you need, then all is fine. If not, start considering numpy arrays. Some typical reasons for moving to numpy arrays are:

  • Your data is 2-dimensional (or higher). Although nested dictionaries/lists can be used to represent multi-dimensional data, in most situations numpy arrays will be more efficient.
  • You have to perform a bunch of numerical calculations. As already pointed out by zhqiat, numpy will give a significant speed-up in this case. Furthermore numpy arrays come bundled with a large amount of mathematical functions.

Then there are also some typical reasons for going beyond numpy arrays and to the more-complex but also more-powerful pandas series/dataframes:

  • You have to merge multiple data sets with each other, or do reshaping/reordering of your data. This diagram gives a nice overview of all the 'data wrangling' operations that pandas allows you to do.
  • You have to import data from or export data to a specific file format like Excel, HDF5 or SQL. Pandas comes with convenient import/export functions for this.

Solution 2:

If you want to an answer which tells you to stick with just one type of data structures, here goes one: use pandas series/dataframe structures.

The pandas series object can be seen as an enhanced numpy 1D array and the pandas dataframe can be seen as an enhanced numpy 2D array. The main difference is that pandas series and pandas dataframes has explicit index, while numpy arrays has implicit indexation. So, in any python code that you think to use something like

import numpy as np
a = np.array([1,2,3])

you can just use

import pandas as pd
a = pd.Series([1,2,3])

All the functions and methods from numpy arrays will work with pandas series. In analogy, the same can be done with dataframes and numpy 2D arrays.

A further question you might have can be about the performance differences between a numpy array and pandas series. Here is a post that shows the differences in performance using these two tools: performance of pandas series vs numpy arrays.

Please note that even in a explicy way pandas series has a subtle worse in performance when compared to numpy, you can solve this by just calling the values method on a pandas series:

a.values

The result of apply the values method on a pandas series will be a numpy array!