How do I get the row count of a Pandas DataFrame?
For a dataframe df
, one can use any of the following:
len(df.index)
df.shape[0]
-
df[df.columns[0]].count()
(== number of non-NaN values in first column)
Code to reproduce the plot:
import numpy as np
import pandas as pd
import perfplot
perfplot.save(
"out.png",
setup=lambda n: pd.DataFrame(np.arange(n * 3).reshape(n, 3)),
n_range=[2**k for k in range(25)],
kernels=[
lambda df: len(df.index),
lambda df: df.shape[0],
lambda df: df[df.columns[0]].count(),
],
labels=["len(df.index)", "df.shape[0]", "df[df.columns[0]].count()"],
xlabel="Number of rows",
)
Suppose df
is your dataframe then:
count_row = df.shape[0] # Gives number of rows
count_col = df.shape[1] # Gives number of columns
Or, more succinctly,
r, c = df.shape
Use len(df)
:-).
__len__()
is documented with "Returns length of index".
Timing info, set up the same way as in root's answer:
In [7]: timeit len(df.index)
1000000 loops, best of 3: 248 ns per loop
In [8]: timeit len(df)
1000000 loops, best of 3: 573 ns per loop
Due to one additional function call, it is of course correct to say that it is a bit slower than calling len(df.index)
directly. But this should not matter in most cases. I find len(df)
to be quite readable.
How do I get the row count of a Pandas DataFrame?
This table summarises the different situations in which you'd want to count something in a DataFrame (or Series, for completeness), along with the recommended method(s).
Footnotes
DataFrame.count
returns counts for each column as aSeries
since the non-null count varies by column.DataFrameGroupBy.size
returns aSeries
, since all columns in the same group share the same row-count.DataFrameGroupBy.count
returns aDataFrame
, since the non-null count could differ across columns in the same group. To get the group-wise non-null count for a specific column, usedf.groupby(...)['x'].count()
where "x" is the column to count.
#Minimal Code Examples
Below, I show examples of each of the methods described in the table above. First, the setup -
df = pd.DataFrame({
'A': list('aabbc'), 'B': ['x', 'x', np.nan, 'x', np.nan]})
s = df['B'].copy()
df
A B
0 a x
1 a x
2 b NaN
3 b x
4 c NaN
s
0 x
1 x
2 NaN
3 x
4 NaN
Name: B, dtype: object
Row Count of a DataFrame: len(df)
, df.shape[0]
, or len(df.index)
len(df)
# 5
df.shape[0]
# 5
len(df.index)
# 5
It seems silly to compare the performance of constant time operations, especially when the difference is on the level of "seriously, don't worry about it". But this seems to be a trend with other answers, so I'm doing the same for completeness.
Of the three methods above, len(df.index)
(as mentioned in other answers) is the fastest.
Note
- All the methods above are constant time operations as they are simple attribute lookups.
df.shape
(similar tondarray.shape
) is an attribute that returns a tuple of(# Rows, # Cols)
. For example,df.shape
returns(8, 2)
for the example here.
Column Count of a DataFrame: df.shape[1]
, len(df.columns)
df.shape[1]
# 2
len(df.columns)
# 2
Analogous to len(df.index)
, len(df.columns)
is the faster of the two methods (but takes more characters to type).
Row Count of a Series: len(s)
, s.size
, len(s.index)
len(s)
# 5
s.size
# 5
len(s.index)
# 5
s.size
and len(s.index)
are about the same in terms of speed. But I recommend len(df)
.
Note
size
is an attribute, and it returns the number of elements (=count of rows for any Series). DataFrames also define a size attribute which returns the same result asdf.shape[0] * df.shape[1]
.
Non-Null Row Count: DataFrame.count
and Series.count
The methods described here only count non-null values (meaning NaNs are ignored).
Calling DataFrame.count
will return non-NaN counts for each column:
df.count()
A 5
B 3
dtype: int64
For Series, use Series.count
to similar effect:
s.count()
# 3
Group-wise Row Count: GroupBy.size
For DataFrames
, use DataFrameGroupBy.size
to count the number of rows per group.
df.groupby('A').size()
A
a 2
b 2
c 1
dtype: int64
Similarly, for Series
, you'll use SeriesGroupBy.size
.
s.groupby(df.A).size()
A
a 2
b 2
c 1
Name: B, dtype: int64
In both cases, a Series
is returned. This makes sense for DataFrames
as well since all groups share the same row-count.
Group-wise Non-Null Row Count: GroupBy.count
Similar to above, but use GroupBy.count
, not GroupBy.size
. Note that size
always returns a Series
, while count
returns a Series
if called on a specific column, or else a DataFrame
.
The following methods return the same thing:
df.groupby('A')['B'].size()
df.groupby('A').size()
A
a 2
b 2
c 1
Name: B, dtype: int64
Meanwhile, for count
, we have
df.groupby('A').count()
B
A
a 2
b 1
c 0
...called on the entire GroupBy object, vs.,
df.groupby('A')['B'].count()
A
a 2
b 1
c 0
Name: B, dtype: int64
Called on a specific column.
TL;DR use len(df)
len()
lets you for getting the number of items in a list. So, for getting row counts of a DataFrame, simply use len(df)
.
Alternatively, you can access all rows and all columns with df.index
, and df.columns
,respectively. Since you can use the len(anyList)
for getting the element numbers, use
len(df.index)
will give you the number of rows, and len(df.columns)
will give the number of columns.
Or, you can use df.shape
which returns the number of rows and columns together (as a tuple). If you want to access the number of rows, only use df.shape[0]
. For the number of columns, only use: df.shape[1]
.