Normalizing a pandas DataFrame by row
Solution 1:
To overcome the broadcasting issue, you can use the div
method:
df.div(df.sum(axis=1), axis=0)
See pandas User Guide: Matching / broadcasting behavior
Solution 2:
We could also get the underlying numpy array, sum on axis while keeping the dimensions and element-wise divide:
df / df.to_numpy().sum(axis=1, keepdims=True)
This method is ~60% faster than sum
on axis + div
by the index:
df = pd.DataFrame(np.random.rand(1000000, 100))
%timeit -n 10 df.div(df.sum(axis=1), axis=0)
748 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 df / df.to_numpy().sum(axis=1, keepdims=True)
452 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In fact, this trend holds if we increase the number of rows and the number of columns:
Code to reproduce the above plots:
import perfplot
import pandas as pd
import numpy as np
def enke(df):
return df / df.to_numpy().sum(axis=1, keepdims=True)
def joris(df):
return df.div(df.sum(axis=1), axis=0)
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.rand(n, 10)),
kernels=[enke, joris],
labels=['enke', 'joris'],
n_range=[2 ** k for k in range(4, 21)],
equality_check=np.allclose,
xlabel='~len(df)',
title='For len(df)x10 DataFrames'
)
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.rand(10000, n)),
kernels=[enke, joris],
labels=['enke', 'joris'],
n_range=[1.4 ** k for k in range(21)],
equality_check=np.allclose,
xlabel='~width(df)',
title='For 10_000xwidth(df) DataFrames'
)