Normalizing a pandas DataFrame by row

Solution 1:

To overcome the broadcasting issue, you can use the div method:

df.div(df.sum(axis=1), axis=0)

See pandas User Guide: Matching / broadcasting behavior

Solution 2:

We could also get the underlying numpy array, sum on axis while keeping the dimensions and element-wise divide:

df / df.to_numpy().sum(axis=1, keepdims=True)

This method is ~60% faster than sum on axis + div by the index:

df = pd.DataFrame(np.random.rand(1000000, 100))

%timeit -n 10 df.div(df.sum(axis=1), axis=0)
748 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 df / df.to_numpy().sum(axis=1, keepdims=True)
452 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In fact, this trend holds if we increase the number of rows and the number of columns:

enter image description here


Code to reproduce the above plots:

import perfplot
import pandas as pd
import numpy as np

def enke(df):
    return df / df.to_numpy().sum(axis=1, keepdims=True)

def joris(df):
    return df.div(df.sum(axis=1), axis=0)

perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.rand(n, 10)), 
    kernels=[enke, joris],
    labels=['enke', 'joris'],
    n_range=[2 ** k for k in range(4, 21)],
    equality_check=np.allclose,  
    xlabel='~len(df)',
    title='For len(df)x10 DataFrames'
)

perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.rand(10000, n)), 
    kernels=[enke, joris],
    labels=['enke', 'joris'],
    n_range=[1.4 ** k for k in range(21)],
    equality_check=np.allclose,  
    xlabel='~width(df)',
    title='For 10_000xwidth(df) DataFrames'
)