Using dplyr window functions to calculate percentiles
In dplyr 1.0
, summarise
can return multiple values, allowing the following:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(quantile = scales::percent(c(0.25, 0.5, 0.75)),
mpg = quantile(mpg, c(0.25, 0.5, 0.75)))
Or, you can avoid a separate line to name the quantiles by going with enframe
:
mtcars %>%
group_by(cyl) %>%
summarise(enframe(quantile(mpg, c(0.25, 0.5, 0.75)), "quantile", "mpg"))
cyl quantile mpg <dbl> <chr> <dbl> 1 4 25% 22.8 2 4 50% 26 3 4 75% 30.4 4 6 25% 18.6 5 6 50% 19.7 6 6 75% 21 7 8 25% 14.4 8 8 50% 15.2 9 8 75% 16.2
Answer for previous versions of dplyr
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(x=list(enframe(quantile(mpg, probs=c(0.25,0.5,0.75)), "quantiles", "mpg"))) %>%
unnest(x)
cyl quantiles mpg 1 4 25% 22.80 2 4 50% 26.00 3 4 75% 30.40 4 6 25% 18.65 5 6 50% 19.70 6 6 75% 21.00 7 8 25% 14.40 8 8 50% 15.20 9 8 75% 16.25
This can be turned into a more general function using tidyeval:
q_by_group = function(data, value.col, ..., probs=seq(0,1,0.25)) {
groups=enquos(...)
data %>%
group_by(!!!groups) %>%
summarise(x = list(enframe(quantile({{value.col}}, probs=probs), "quantiles", "mpg"))) %>%
unnest(x)
}
q_by_group(mtcars, mpg)
q_by_group(mtcars, mpg, cyl)
q_by_group(mtcars, mpg, cyl, vs, probs=c(0.5,0.75))
q_by_group(iris, Petal.Width, Species)
If you're up for using purrr::map
, you can do it like this!
library(tidyverse)
mtcars %>%
tbl_df() %>%
nest(-cyl) %>%
mutate(Quantiles = map(data, ~ quantile(.$mpg)),
Quantiles = map(Quantiles, ~ bind_rows(.) %>% gather())) %>%
unnest(Quantiles)
#> # A tibble: 15 x 3
#> cyl key value
#> <dbl> <chr> <dbl>
#> 1 6 0% 17.8
#> 2 6 25% 18.6
#> 3 6 50% 19.7
#> 4 6 75% 21
#> 5 6 100% 21.4
#> 6 4 0% 21.4
#> 7 4 25% 22.8
#> 8 4 50% 26
#> 9 4 75% 30.4
#> 10 4 100% 33.9
#> 11 8 0% 10.4
#> 12 8 25% 14.4
#> 13 8 50% 15.2
#> 14 8 75% 16.2
#> 15 8 100% 19.2
Created on 2018-11-10 by the reprex package (v0.2.1)
One nice thing about this approach is the output is tidy, one observation per row.
This is a dplyr
approach that uses the tidy()
function of the broom
package, unfortunately it still requires do()
, but it is a lot simpler.
library(dplyr)
library(broom)
mtcars %>%
group_by(cyl) %>%
do( tidy(t(quantile(.$mpg))) )
which gives:
cyl X0. X25. X50. X75. X100.
(dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 4 21.4 22.80 26.0 30.40 33.9
2 6 17.8 18.65 19.7 21.00 21.4
3 8 10.4 14.40 15.2 16.25 19.2
Note the use of t()
since the broom
package does not have a method for named numerics.
This is based on my earlier answer for summary() here.
Not sure how to avoid do()
in dplyr
, but you can do this with c()
and as.list()
with data.table
in a pretty straightforward manner:
require(data.table)
as.data.table(mtcars)[, c(as.list(quantile(mpg, probs=p)),
avg=mean(mpg), n=.N), by=cyl]
# cyl 25% 50% 75% avg n
# 1: 6 18.65 19.7 21.00 19.74286 7
# 2: 4 22.80 26.0 30.40 26.66364 11
# 3: 8 14.40 15.2 16.25 15.10000 14
Replace by
with keyby
if you want them ordered by cyl
column.