How to do Gaussian fit and FWHM measurement by group?
I have a pandas.DataFrame
of the form. I'll show you a simple example.(In reality, it consists of hundreds of millions of rows of data.). I want to measure Gaussian fit and FWHM for DIFF after grouping by ID.
df =
X ID diff
1 52 11.0
12 85 -102.0
17 42 43.0
18 2 81.0
59 122 -10.0
78 21 -43.0
96 144 -6.0
101 76 -56.0
113 119 -75.0
120 82 4.0
134 83 11.0
139 39 16.0
152 12 -61.0
169 139 -124.0
170 37 26.0
173 35 -190.0
185 103 -64.0
192 122 -72.0
193 108 51.0
195 88 -30.0
199 43 -100.0
209 89 -154.0
243 32 94.0
246 138 -25.0
250 50 2.0
258 53 167.0
261 42 -23.0
272 69 -64.0
276 95 -14.0
279 25 -115.0
286 79 -65.0
288 82 2.0
332 43 213.0
- What do I do after "df[['ID','diff']].groupby(['ID'])" ?
- In fact, 'ID' ranges from 1 to 144.
- A graph image is not essential.
- I only need the result value.
- The above example is some data out of the whole.
Solution 1:
Trial dataset
Let's create a dummy dataset to shoulder the discussion:
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed(123)
def generate_data(identifier, size=100, loc=0., scale=1.):
return [
{"id": identifier, "value": value}
for value in scale*np.random.randn(size) + loc
]
df = pd.DataFrame(
generate_data(1) + generate_data(2, loc=1, scale=2) + generate_data(3, loc=-1, scale=3)
)
A random sample looks like:
id value
263 3 -2.750610
135 2 1.646938
285 3 -3.047614
258 3 -0.911071
154 2 -1.039310
Group By and Apply
The key to solve your problem is the apply
method exposed by DataFrameGroupBy
object returned by groupby
.
First create a function with the following interface:
- Take a
DataFame
as input; - Implement the desired logic;
- Return a
Series
as ouptut.
From your problem statement, it boils down to:
def analyze(frame):
params = stats.norm.fit(frame["value"])
fwhm = 2*np.sqrt(2*np.log(2))*params[1]
return pd.Series({
"loc": params[0], "scale": params[1],
"count": frame.shape[0], "fwhm": fwhm
})
Then simply apply on grouped data:
df.groupby("id").apply(analyze)
The DataFrameGroupBy
will call this function on every grouped DataFrame
(buckets) and return a new DataFrame
with the Series
fields as columns for each bucket as the index.
It returns:
loc scale count fwhm
id
1 0.027109 1.128240 100.0 2.656803
2 0.960929 1.940107 100.0 4.568603
3 -1.285394 2.908368 100.0 6.848684