dplyr - using mutate() like rowmeans()

I can't find the answer anywhere.

I would like to calculate new variable of data frame which is based on mean of rows.

For example:

data <- data.frame(id=c(101,102,103), a=c(1,2,3), b=c(2,2,2), c=c(3,3,3))

I want to use mutate to make variable d which is mean of a,b and c. And I would like to be able to make that by selecting columns in way d=mean(a,b,c), and also I need to use range of variables (like in dplyr) d=mean(a:c).

And of course

mutate(data, c=mean(a,b)) 

or

mutate(data, c=rowMeans(a,b)) 

doesn't work.

Can you give me some tip?

Regards


Solution 1:

You're looking for

data %>% 
    rowwise() %>% 
    mutate(c=mean(c(a,b)))

#      id     a     b     c
#   (dbl) (dbl) (dbl) (dbl)
# 1   101     1     2   1.5
# 2   102     2     2   2.0
# 3   103     3     2   2.5

or

library(purrr)
data %>% 
    rowwise() %>% 
    mutate(c=lift_vd(mean)(a,b))

Solution 2:

dplyr is badly suited to operate on this kind of data because it assumes tidy data format and — for the problem in question — your data is untidy.

You can of course tidy it first:

tidy_data = tidyr::gather(data, name, value, -id)

Which looks like this:

   id name value
1 101    a     1
2 102    a     2
3 103    a     3
4 101    b     2
5 102    b     2
6 103    b     2
    …

And then:

tidy_data %>% group_by(id) %>% summarize(mean = mean(value))
    name  mean
  (fctr) (dbl)
1      a     2
2      b     2
3      c     3

Of course this discards the original data. You could use mutate instead of summarize to avoid this. Finally, you can then un-tidy your data again:

tidy_data %>%
    group_by(id) %>%
    mutate(mean = mean(value)) %>%
    tidyr::spread(name, value)
     id     mean     a     b     c
  (dbl)    (dbl) (dbl) (dbl) (dbl)
1   101 2.000000     1     2     3
2   102 2.333333     2     2     3
3   103 2.666667     3     2     3

Alternatively, you could summarise and then merge the result with the original table:

tidy_data %>%
    group_by(id) %>%
    summarize(mean = mean(value)) %>%
    inner_join(data, by = 'id')

The result is the same in either case. I conceptually prefer the second variant.