dplyr::summarise with filtering inside
Inside of dplyr::summarise, how can I apply filters based on different rows than the one I'm summarising?
Example:
t = data.frame(
x = c(1,1,1,1,2,2,2,2,3,3, 3, 3),
y = c(1,2,3,4,5,6,7,8,9,10,11,12),
z = c(1,2,1,2,1,2,1,2,1,2, 1, 2)
)
t %>%
dplyr::group_by(x) %>%
dplyr::summarise(
mall = mean(y), # this should include all rows in each group
ma = mean(y), # this should only include rows where z == 1
mb = mean(y) # this should only include rows where z == 2
)
So, the problem here is to apply a summary function to one column, while filtering based on another, all within summarise
.
One idea was double-grouping, so applying group_by
on both x and z, but I don't want all summary columns to be based on double-grouping, some (like mall
in the example above) should be based on single-grouping only.
One quick option would be to use ifelse
to filter to the rows you need, make the rest missing and use the na.rm = T
argument to ignore missing values, like the example below.
dplyr::group_by(x) %>%
dplyr::summarise(
mall = mean(y), # this should include all rows in each group
ma = mean(ifelse(z == 1, y, NA), na.rm = T), # this should only include rows where z == 1
mb = mean(ifelse(z == 2, y, NA), na.rm = T) # this should only include rows where z == 2
)
# A tibble: 3 x 4
x mall ma mb
<dbl> <dbl> <dbl> <dbl>
1 1 2.5 2 3
2 2 6.5 6 7
3 3 10.5 10 11
While the answer by @Colin H is certainly the way to go for this specific example, a more flexible way to approach this could be to work within the subsets of the first grouping operation. This could be implemented with dplyr::group_split
plus a subsequent purrr::map_dfr
, but there is also dplyr::group_modify
to do this in one step.
Note this relevant sentence from the documentation of dplyr::group_modify
:
Use group_modify() when summarize() is too limited, in terms of what you need to do and return for each group.
So here is a solution for the example provided above:
t = data.frame(
x = c(1,1,1,1,2,2,2,2,3,3, 3, 3),
y = c(1,2,3,4,5,6,7,8,9,10,11,12),
z = c(1,2,1,2,1,2,1,2,1,2, 1, 2)
)
t %>%
dplyr::group_by(x) %>%
dplyr::group_modify(function(x, ...) {
x %>% dplyr::mutate(
mall = mean(y)
) %>%
dplyr::group_by(z, mall) %>%
dplyr::summarise(
m = mean(y),
.groups = "drop"
)
}) %>%
dplyr::ungroup()
# A tibble: 6 x 4
x z mall m
<dbl> <dbl> <dbl> <dbl>
1 1 1 2.5 2
2 1 2 2.5 3
3 2 1 6.5 6
4 2 2 6.5 7
5 3 1 10.5 10
6 3 2 10.5 11
group_modify
applies a function to each subset tibble after grouping by x
. This function has two arguments:
The subset of the data for the group, exposed as .x.
The key, a tibble with exactly one row and columns for each grouping variable, exposed as .y.
Within our function here we use mutate
to cover the requested mall
-case first. We do not need any further grouping for that, because that is already covered by the wrapping group_modify
. Then we apply another group_by
+ summarise
to cover the different iterations of z
. Note that this solution is independent of the number of cases in z
we want to consider. While the two cases in this example can be easily handled manually, this might change if there are more.
If the wide output format with individual columns for the cases in z
is required, then you can further modify the output of my code with tidyr::pivot_wider
.
Another option and perhaps a little more concise is via subsetting:
t %>%
group_by(x) %>%
summarise(mall = mean(y),
ma = mean(y[z == 1]),
mb = mean(y[z == 2]))
# A tibble: 3 x 4
x mall ma mb
* <dbl> <dbl> <dbl> <dbl>
1 1 2.5 2 3
2 2 6.5 6 7
3 3 10.5 10 11