Sum across multiple columns with dplyr

r dplyr

My question involves summing up values across multiple columns of a data frame and creating a new column corresponding to this summation using dplyr. The data entries in the columns are binary(0,1). I am thinking of a row-wise analog of the summarise_each or mutate_each function of dplyr. Below is a minimal example of the data frame:

library(dplyr)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))

> df
   x1 x2 x3 x4 x5
1   1  1  0  1  1
2   0  1  1  0  1
3   0 NA  0 NA NA
4  NA  1  1  1  1
5   0  1  1  0  1
6   1  0  0  0  1
7   1 NA NA NA NA
8  NA NA NA  0  1
9   0  0  0  0  0
10  1  1  1  1  1

I could use something like:

df <- df %>% mutate(sumrow= x1 + x2 + x3 + x4 + x5)

but this would involve writing out the names of each of the columns. I have like 50 columns. In addition, the column names change at different iterations of the loop in which I want to implement this operation so I would like to try avoid having to give any column names.

How can I do that most efficiently? Any assistance would be greatly appreciated.

dplyr >= 1.0.0 using across

sum up each row using rowSums (rowwise works for any aggreation, but is slower)

df %>%
   replace(is.na(.), 0) %>%
   mutate(sum = rowSums(across(where(is.numeric))))

sum down each column

df %>%
   summarise(across(everything(), ~ sum(., is.na(.), 0)))

dplyr < 1.0.0

sum up each row

df %>%
   replace(is.na(.), 0) %>%
   mutate(sum = rowSums(.[1:5]))

sum down each column using superseeded summarise_all:

df %>%
   replace(is.na(.), 0) %>%
   summarise_all(funs(sum))

If you want to sum certain columns only, I'd use something like this:

library(dplyr)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))
df %>% select(x3:x5) %>% rowSums(na.rm=TRUE) -> df$x3x5.total
head(df)

This way you can use dplyr::select's syntax.

Sum across multiple columns with dplyr

dplyr >= 1.0.0 using across

dplyr < 1.0.0

Related

Recent Posts