What does axis in pandas mean?

Solution 1:

It specifies the axis along which the means are computed. By default axis=0. This is consistent with the numpy.mean usage when axis is specified explicitly (in numpy.mean, axis==None by default, which computes the mean value over the flattened array) , in which axis=0 along the rows (namely, index in pandas), and axis=1 along the columns. For added clarity, one may choose to specify axis='index' (instead of axis=0) or axis='columns' (instead of axis=1).

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
             |         |
             | axis=0  |
             ↓         ↓

Solution 2:

These answers do help explain this, but it still isn't perfectly intuitive for a non-programmer (i.e. someone like me who is learning Python for the first time in context of data science coursework). I still find using the terms "along" or "for each" wrt to rows and columns to be confusing.

What makes more sense to me is to say it this way:

  • Axis 0 will act on all the ROWS in each COLUMN
  • Axis 1 will act on all the COLUMNS in each ROW

So a mean on axis 0 will be the mean of all the rows in each column, and a mean on axis 1 will be a mean of all the columns in each row.

Ultimately this is saying the same thing as @zhangxaochen and @Michael, but in a way that is easier for me to internalize.

Solution 3:

Let's visualize (you gonna remember always), enter image description here

In Pandas:

  1. axis=0 means along "indexes". It's a row-wise operation.

Suppose, to perform concat() operation on dataframe1 & dataframe2, we will take dataframe1 & take out 1st row from dataframe1 and place into the new DF, then we take out another row from dataframe1 and put into new DF, we repeat this process until we reach to the bottom of dataframe1. Then, we do the same process for dataframe2.

Basically, stacking dataframe2 on top of dataframe1 or vice a versa.

E.g making a pile of books on a table or floor

  1. axis=1 means along "columns". It's a column-wise operation.

Suppose, to perform concat() operation on dataframe1 & dataframe2, we will take out the 1st complete column(a.k.a 1st series) of dataframe1 and place into new DF, then we take out the second column of dataframe1 and keep adjacent to it (sideways), we have to repeat this operation until all columns are finished. Then, we repeat the same process on dataframe2. Basically, stacking dataframe2 sideways.

E.g arranging books on a bookshelf.

More to it, since arrays are better representations to represent a nested n-dimensional structure compared to matrices! so below can help you more to visualize how axis plays an important role when you generalize to more than one dimension. Also, you can actually print/write/draw/visualize any n-dim array but, writing or visualizing the same in a matrix representation(3-dim) is impossible on a paper more than 3-dimensions.

enter image description here

Solution 4:

axis refers to the dimension of the array, in the case of pd.DataFrames axis=0 is the dimension that points downwards and axis=1 the one that points to the right.

Example: Think of an ndarray with shape (3,5,7).

a = np.ones((3,5,7))

a is a 3 dimensional ndarray, i.e. it has 3 axes ("axes" is plural of "axis"). The configuration of a will look like 3 slices of bread where each slice is of dimension 5-by-7. a[0,:,:] will refer to the 0-th slice, a[1,:,:] will refer to the 1-st slice etc.

a.sum(axis=0) will apply sum() along the 0-th axis of a. You will add all the slices and end up with one slice of shape (5,7).

a.sum(axis=0) is equivalent to

b = np.zeros((5,7))
for i in range(5):
    for j in range(7):
        b[i,j] += a[:,i,j].sum()

b and a.sum(axis=0) will both look like this

array([[ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.]])

In a pd.DataFrame, axes work the same way as in numpy.arrays: axis=0 will apply sum() or any other reduction function for each column.

N.B. In @zhangxaochen's answer, I find the phrases "along the rows" and "along the columns" slightly confusing. axis=0 should refer to "along each column", and axis=1 "along each row".