binning a dataframe in pandas in Python [duplicate]
Given the following dataframe in pandas:
import numpy as np
df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})
where id
is an id for each point consisting of an a
and b
value, how can I bin a
and b
into a specified set of bins (so that I can then take the median/average value of a
and b
in each bin)? df
might have NaN
values for a
or b
(or both) for any given row in df
.
Here's a better example using Joe Kington's solution with a more realistic df
. The thing I'm unsure about is how to access the df.b
elements for each df.a
group below:
a = np.random.random(20)
df = pandas.DataFrame({"a": a, "b": a + 10})
# bins for df.a
bins = np.linspace(0, 1, 10)
# bin df according to a
groups = df.groupby(np.digitize(df.a,bins))
# Get the mean of a in each group
print groups.mean()
## But how to get the mean of b for each group of a?
# ...
There may be a more efficient way (I have a feeling pandas.crosstab
would be useful here), but here's how I'd do it:
import numpy as np
import pandas
df = pandas.DataFrame({"a": np.random.random(100),
"b": np.random.random(100),
"id": np.arange(100)})
# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))
# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"
# Similarly, the median:
print groups.median()
# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))
Edit: As the OP was asking specifically for just the means of b
binned by the values in a
, just do
groups.mean().b
Also if you wanted the index to look nicer (e.g. display intervals as the index), as they do in @bdiamante's example, use pandas.cut
instead of numpy.digitize
. (Kudos to bidamante. I didn't realize pandas.cut
existed.)
import numpy as np
import pandas
df = pandas.DataFrame({"a": np.random.random(100),
"b": np.random.random(100) + 10})
# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))
# Get the mean of b, binned by the values in a
print groups.mean().b
This results in:
a
(0.00186, 0.111] 10.421839
(0.111, 0.22] 10.427540
(0.22, 0.33] 10.538932
(0.33, 0.439] 10.445085
(0.439, 0.548] 10.313612
(0.548, 0.658] 10.319387
(0.658, 0.767] 10.367444
(0.767, 0.876] 10.469655
(0.876, 0.986] 10.571008
Name: b
Not 100% sure if this is what you're looking for, but here's what I think you're getting at:
In [144]: df = DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})
In [145]: bins = [0, .25, .5, .75, 1]
In [146]: a_bins = df.a.groupby(cut(df.a,bins))
In [147]: b_bins = df.b.groupby(cut(df.b,bins))
In [148]: a_bins.agg([mean,median])
Out[148]:
mean median
a
(0, 0.25] 0.124173 0.114613
(0.25, 0.5] 0.367703 0.358866
(0.5, 0.75] 0.624251 0.626730
(0.75, 1] 0.875395 0.869843
In [149]: b_bins.agg([mean,median])
Out[149]:
mean median
b
(0, 0.25] 0.147936 0.166900
(0.25, 0.5] 0.394918 0.386729
(0.5, 0.75] 0.636111 0.655247
(0.75, 1] 0.851227 0.838805
Of course, I don't know what bins you had in mind, so you'll have to swap mine out for your circumstance.
Joe Kington's answer was very helpful, however, I noticed that it does not bin all of the data. It actually leaves out the row with a = a.min(). Summing up groups.size()
gave 99 instead of 100.
To guarantee that all data is binned, just pass in the number of bins to cut() and that function will automatically pad the first[last] bin by 0.1% to ensure all data is included.
df = pandas.DataFrame({"a": np.random.random(100),
"b": np.random.random(100) + 10})
# Bin the data frame by "a" with 10 bins...
groups = df.groupby(pandas.cut(df.a, 10))
# Get the mean of b, binned by the values in a
print(groups.mean().b)
In this case, summing up groups.size() gave 100.
I know this is a picky point for this particular problem, but for a similar problem I was trying to solve, it was crucial to obtain the correct answer.