How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
There's a handy ntile
function in package dplyr
. It's flexible in the sense that you can very easily define the number of *tiles or "bins" you want to create.
Load the package (install first if you haven't) and add the quartile column:
library(dplyr)
temp$quartile <- ntile(temp$value, 4)
Or, if you want to use dplyr syntax:
temp <- temp %>% mutate(quartile = ntile(value, 4))
Result in both cases is:
temp
# name value quartile
#1 a -0.56047565 1
#2 b -0.23017749 2
#3 c 1.55870831 4
#4 d 0.07050839 2
#5 e 0.12928774 3
#6 f 1.71506499 4
#7 g 0.46091621 3
#8 h -1.26506123 1
#9 i -0.68685285 1
#10 j -0.44566197 2
#11 k 1.22408180 4
#12 l 0.35981383 3
data:
Note that you don't need to create the "quartile" column in advance and use set.seed
to make the randomization reproducible:
set.seed(123)
temp <- data.frame(name=letters[1:12], value=rnorm(12))
The method I use is one of these or Hmisc::cut2(value, g=4)
:
temp$quartile <- with(temp, cut(value,
breaks=quantile(value, probs=seq(0,1, by=0.25), na.rm=TRUE),
include.lowest=TRUE))
An alternate might be:
temp$quartile <- with(temp, factor(
findInterval( val, c(-Inf,
quantile(val, probs=c(0.25, .5, .75)), Inf) , na.rm=TRUE),
labels=c("Q1","Q2","Q3","Q4")
))
The first one has the side-effect of labeling the quartiles with the values, which I consider a "good thing", but if it were not "good for you", or the valid problems raised in the comments were a concern you could go with version 2. You can use labels=
in cut
, or you could add this line to your code:
temp$quartile <- factor(temp$quartile, levels=c("1","2","3","4") )
Or even quicker but slightly more obscure in how it works, although it is no longer a factor, but rather a numeric vector:
temp$quartile <- as.numeric(temp$quartile)
I'll add the data.table
version for anyone else Googling it (i.e., @BondedDust's solution translated to data.table
and pared down a tad):
library(data.table)
setDT(temp)
temp[ , quartile := cut(value,
breaks = quantile(value, probs = 0:4/4),
labels = 1:4, right = FALSE)]
Which is much better (cleaner, faster) than what I had been doing:
temp[ , quartile :=
as.factor(ifelse(value < quantile(value, .25), 1,
ifelse(value < quantile(value, .5), 2,
ifelse(value < quantile(value, .75), 3, 4))]
Note, however, that this approach requires the quantiles to be distinct, e.g. it will fail on rep(0:1, c(100, 1))
; what to do in this case is open ended so I leave it up to you.
You can use the quantile()
function, but you need to handle rounding/precision when using cut()
. So
set.seed(123)
temp <- data.frame(name=letters[1:12], value=rnorm(12), quartile=rep(NA, 12))
brks <- with(temp, quantile(value, probs = c(0, 0.25, 0.5, 0.75, 1)))
temp <- within(temp, quartile <- cut(value, breaks = brks, labels = 1:4,
include.lowest = TRUE))
Giving:
> head(temp)
name value quartile
1 a -0.56047565 1
2 b -0.23017749 2
3 c 1.55870831 4
4 d 0.07050839 2
5 e 0.12928774 3
6 f 1.71506499 4
Adapting dplyr::ntile
to take advantage of data.table
optimizations provides a faster solution.
library(data.table)
setDT(temp)
temp[order(value) , quartile := floor( 1 + 4 * (.I-1) / .N)]
Probably doesn't qualify as cleaner, but it's faster and one-line.
Timing on bigger data set
Comparing this solution to ntile
and cut
for data.table
as proposed by @docendo_discimus and @MichaelChirico.
library(microbenchmark)
library(dplyr)
set.seed(123)
n <- 1e6
temp <- data.frame(name=sample(letters, size=n, replace=TRUE), value=rnorm(n))
setDT(temp)
microbenchmark(
"ntile" = temp[, quartile_ntile := ntile(value, 4)],
"cut" = temp[, quartile_cut := cut(value,
breaks = quantile(value, probs = seq(0, 1, by=1/4)),
labels = 1:4, right=FALSE)],
"dt_ntile" = temp[order(value), quartile_ntile_dt := floor( 1 + 4 * (.I-1)/.N)]
)
Gives:
Unit: milliseconds
expr min lq mean median uq max neval
ntile 608.1126 647.4994 670.3160 686.5103 691.4846 712.4267 100
cut 369.5391 373.3457 375.0913 374.3107 376.5512 385.8142 100
dt_ntile 117.5736 119.5802 124.5397 120.5043 124.5902 145.7894 100