How to do Gaussian fit and FWHM measurement by group?

I have a pandas.DataFrame of the form. I'll show you a simple example.(In reality, it consists of hundreds of millions of rows of data.). I want to measure Gaussian fit and FWHM for DIFF after grouping by ID.

df = 
X ID diff

1 52 11.0
12 85 -102.0
17 42 43.0
18 2 81.0
59 122 -10.0
78 21 -43.0
96 144 -6.0
101 76 -56.0
113 119 -75.0
120 82 4.0
134 83 11.0
139 39 16.0
152 12 -61.0
169 139 -124.0
170 37 26.0
173 35 -190.0
185 103 -64.0
192 122 -72.0
193 108 51.0
195 88 -30.0
199 43 -100.0
209 89 -154.0
243 32 94.0
246 138 -25.0
250 50 2.0
258 53 167.0
261 42 -23.0
272 69 -64.0
276 95 -14.0
279 25 -115.0
286 79 -65.0
288 82 2.0
332 43 213.0
  1. What do I do after "df[['ID','diff']].groupby(['ID'])" ?
  2. In fact, 'ID' ranges from 1 to 144.
  3. A graph image is not essential.
  4. I only need the result value.
  5. The above example is some data out of the whole.

Solution 1:

Trial dataset

Let's create a dummy dataset to shoulder the discussion:

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(123)

def generate_data(identifier, size=100, loc=0., scale=1.):
    return [
        {"id": identifier, "value": value}
        for value in scale*np.random.randn(size) + loc
    ]

df = pd.DataFrame(
    generate_data(1) + generate_data(2, loc=1, scale=2) + generate_data(3, loc=-1, scale=3)
)

A random sample looks like:

     id     value
263   3 -2.750610
135   2  1.646938
285   3 -3.047614
258   3 -0.911071
154   2 -1.039310

Group By and Apply

The key to solve your problem is the apply method exposed by DataFrameGroupBy object returned by groupby.

First create a function with the following interface:

  • Take a DataFame as input;
  • Implement the desired logic;
  • Return a Series as ouptut.

From your problem statement, it boils down to:

def analyze(frame):
    params = stats.norm.fit(frame["value"])
    fwhm = 2*np.sqrt(2*np.log(2))*params[1]
    return pd.Series({
         "loc": params[0], "scale": params[1],
         "count": frame.shape[0], "fwhm": fwhm
    })

Then simply apply on grouped data:

df.groupby("id").apply(analyze)

The DataFrameGroupBy will call this function on every grouped DataFrame (buckets) and return a new DataFrame with the Series fields as columns for each bucket as the index.

It returns:

         loc     scale  count      fwhm
id                                     
1   0.027109  1.128240  100.0  2.656803
2   0.960929  1.940107  100.0  4.568603
3  -1.285394  2.908368  100.0  6.848684