Efficiently sum across multiple columns in R

I have the following condensed data set:

a<-as.data.frame(c(2000:2005))
a$Col1<-c(1:6)
a$Col2<-seq(2,12,2)

colnames(a)<-c("year","Col1","Col2")

for (i in 1:2){
  a[[paste("Var_", i, sep="")]]<-i*a[[paste("Col", i, sep="")]]
}

I would like to sum the columns Var1 and Var2, which I use:

a$sum<-a$Var_1 + a$Var_2

In reality my data set is much larger - I would like to sum from Var_1 to Var_n (n can be upto 20). There must be a more efficient way to do this than:

 a$sum<-a$Var_1 + ... + a$Var_n

Solution 1:

Here's a solution using the tidyverse. You can extend it to as many columns as you like using the select() function to select the appropriate columns within a mutate().

library(tidyverse)

a<-as.data.frame(c(2000:2005))
a$Col1<-c(1:6)
a$Col2<-seq(2,12,2)

colnames(a)<-c("year","Col1","Col2")

for (i in 1:2){
    a[[paste("Var_", i, sep="")]]<-i*a[[paste("Col", i, sep="")]]
}
a
#>   year Col1 Col2 Var_1 Var_2
#> 1 2000    1    2     1     4
#> 2 2001    2    4     2     8
#> 3 2002    3    6     3    12
#> 4 2003    4    8     4    16
#> 5 2004    5   10     5    20
#> 6 2005    6   12     6    24

# Tidyverse solution
a %>%
    mutate(Total = select(., Var_1:Var_2) %>% rowSums(na.rm = TRUE))
#>   year Col1 Col2 Var_1 Var_2 Total
#> 1 2000    1    2     1     4     5
#> 2 2001    2    4     2     8    10
#> 3 2002    3    6     3    12    15
#> 4 2003    4    8     4    16    20
#> 5 2004    5   10     5    20    25
#> 6 2005    6   12     6    24    30

Created on 2019-01-01 by the reprex package (v0.2.1)

Solution 2:

You can use colSums(a[,c("Var1", "Var2")]) or rowSums(a[,c("Var_1", "Var_2")]). In your case you want the latter.

Solution 3:

with dplyr you can use

a %>%
rowwise() %>%
mutate(sum = sum(Col1,Col1, na.rm = T))

or more efficiently

a %>%
rowwise() %>%
mutate(sum = sum(across(starts_with("Col")), na.rm = T))

Solution 4:

If you're working with a very large dataset, rowSums can be slow.

An alternative is the rowsums function from the Rfast package. This requires you to convert your data to a matrix in the process and use column indices rather than names. Here's an example based on your code:

## load Rfast
library(Rfast)

## create dataset
a <- as.data.frame(c(2000:2005))
a$Col1 <- c(1:6)
a$Col2 <- seq(2,12,2)

colnames(a) <- c("year","Col1","Col2")

for (i in 1:2){
  a[[paste("Var_", i, sep="")]] <- i*a[[paste("Col", i, sep="")]]
}

## get column indices based on names
col_st <- grep("Var_1", colnames(a))  # index of "Var_1" col
col_en <- grep("Var_2", colnames(a))  # index of "Var_2" col
cols   <- c(col_st:col_en)  # indices of all cols from "Var_1" to "Var_2"

## sum rows 4 to 5
a$Total <- rowsums(as.matrix(a[,cols]))