With `pandas.cut()`, how do I get integer bins and avoid getting a negative lowest bound?
you should specifically set the labels
argument
preparations:
lower, higher = df['value'].min(), df['value'].max()
n_bins = 7
build up the labels:
edges = range(lower, higher, (higher - lower)/n_bins) # the number of edges is 8
lbs = ['(%d, %d]'%(edges[i], edges[i+1]) for i in range(len(edges)-1)]
set labels:
df['binned_df_pd'] = pd.cut(df.value, bins=n_bins, labels=lbs, include_lowest=True)
None of the other answers (including OP's np.histogram
workaround) seem to work anymore. They have upvotes, so I'm not sure if something has changed over the years.
IntervalIndex
requires all intervals to be closed identically, so [0, 53]
cannot coexist with (322, 376]
.
Here are two working solutions based on the relabeling approach:
-
Without numpy, reuse
pd.cut
edges aspd.cut
labelsbins = 7 _, edges = pd.cut(df.value, bins=bins, retbins=True) labels = [f'({abs(edges[i]):.0f}, {edges[i+1]:.0f}]' for i in range(bins)] df['bin'] = pd.cut(df.value, bins=bins, labels=labels) # value bin # 1 8 (0, 53] # 2 16 (0, 53] # .. ... ... # 45 360 (322, 376] # 46 368 (322, 376]
-
With numpy, convert
np.linspace
edges intopd.cut
labelsbins = 7 edges = np.linspace(df.value.min(), df.value.max(), bins+1).astype(int) labels = [f'({edges[i]}, {edges[i+1]}]' for i in range(bins)] df['bin'] = pd.cut(df.value, bins=bins, labels=labels) # value bin # 1 8 (0, 53] # 2 16 (0, 53] # .. ... ... # 45 360 (322, 376] # 46 368 (322, 376]
Note: Only the labels are changed, so the underlying binning will still occur with 0.1% margins.
pointplot()
output (as of pandas 1.2.4):
sns.pointplot(x='bin', y='value', data=df)
plt.xticks(rotation=30, ha='right')