data.table row-wise sum, mean, min, max like dplyr?
Solution 1:
You can use an efficient row-wise functions from matrixStats
package.
library(matrixStats)
dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
MAX = rowMaxs(as.matrix(.SD), na.rm=T),
AVG = rowMeans(.SD, na.rm=T),
SUM = rowSums(.SD, na.rm=T)), .SDcols=c(Q1, Q2,Q3,Q4)]
dt
# ProductName Country Q1 Q2 Q3 Q4 MIN MAX AVG SUM
# 1: Lettuce CA NA 22 51 79 22 79 50.66667 152
# 2: Beetroot FR 61 8 NA 10 8 61 26.33333 79
# 3: Spinach FR 40 NA 79 49 40 79 56.00000 168
# 4: Kale CA 54 5 16 NA 5 54 25.00000 75
# 5: Carrot CA NA NA NA NA Inf -Inf NaN 0
For dataset with 500000 rows(using the data.table
from CRAN)
dt <- rbindlist(lapply(1:100000, function(i)dt))
system.time(dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
MAX = rowMaxs(as.matrix(.SD), na.rm=T),
AVG = rowMeans(.SD, na.rm=T),
SUM = rowSums(.SD, na.rm=T)), .SDcols=c("Q1", "Q2","Q3","Q4")])
# user system elapsed
# 0.089 0.004 0.093
rowwise
(or by=1:nrow(dt)
) is "euphemism" for for loop
, as exemplified by
library(dplyr) ; library(magrittr)
system.time(dt %>% rowwise() %>%
transmute(ProductName, Country, Q1, Q2, Q3, Q4,
MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE)))
# user system elapsed
# 80.832 0.111 80.974
system.time(dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c("Q1", "Q2","Q3","Q4"),by=1:nrow(dt)] )
# user system elapsed
# 141.492 0.196 141.757
Solution 2:
With by=1:nrow(dt)
, performs the rowwise operation in data.table
library(data.table)
dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c(Q1, Q2,Q3,Q4),by=1:nrow(dt)]
ProductName Country Q1 Q2 Q3 Q4 AVG MIN MAX SUM
1: Lettuce CA NA 22 51 79 50.66667 22 79 152
2: Beetroot FR 61 8 NA 10 26.33333 8 61 79
3: Spinach FR 40 NA 79 49 56.00000 40 79 168
4: Kale CA 54 5 16 NA 25.00000 5 54 75
5: Carrot CA NA NA NA NA NaN Inf -Inf 0
Warning messages:
1: In min(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) :
no non-missing arguments to min; returning Inf
2: In max(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
You got warning messages, because in row 5, you are computing max, sum, min, and max of nothing. For example, see below:
min(c(NA,NA,NA,NA),na.rm=TRUE)
[1] Inf
Warning message:
In min(c(NA, NA, NA, NA), na.rm = TRUE) :
no non-missing arguments to min; returning Inf
Solution 3:
Just another way (not that efficient though, as na.omit()
is called each time, and many memory allocations as well):
require(data.table)
new_cols = c("MIN", "MAX", "SUM", "AVG")
dt[, (new_cols) := Map(function(x, f) f(x),
list(na.omit(c(Q1,Q2,Q3,Q4))),
list(min, max, sum, mean)),
by = 1:nrow(dt)]
# ProductName Country Q1 Q2 Q3 Q4 MIN MAX SUM AVG
# 1: Lettuce CA NA 22 51 79 22 79 152 50.66667
# 2: Beetroot FR 61 8 NA 10 8 61 79 26.33333
# 3: Spinach FR 40 NA 79 49 40 79 168 56.00000
# 4: Kale CA 54 5 16 NA 5 54 75 25.00000
# 5: Carrot CA NA NA NA NA Inf -Inf 0 NaN
But as I mentioned, this'll get much simpler once colwise()
and rowwise()
are implemented. The syntax in this case could look something like:
dt[, rowwise(.SD, list(MIN=min, MAX=max, SUM=sum, AVG=mean), na.rm=TRUE), by = 1:nrow(dt)]
# `by = ` is really not necessary in this case.
or even more straightforward for this case:
rowwise(dt, list(...), na.rm=TRUE)
Edit:
Another variation:
myNACount <- function(x, ...) length(attributes(x)$na.action)
foo <- function(x, ...) {
funs = c(min, max, mean, sum, myNACount)
lapply(funs, function(f) f(x, ...))
}
dt[, (new_cols) := foo(na.omit(c(Q1, Q2, Q3, Q4)), na.rm=TRUE), by=1:nrow(dt)]
# ProductName Country Q1 Q2 Q3 Q4 MIN MAX SUM AVG NAs
# 1: Lettuce CA NA 22 51 79 22 79 50.66667 152 1
# 2: Beetroot FR 61 8 NA 10 8 61 26.33333 79 1
# 3: Spinach FR 40 NA NA 49 40 49 44.50000 89 2
# 4: Kale CA 54 5 16 NA 5 54 25.00000 75 1
# 5: Carrot CA NA NA NA NA Inf -Inf NaN 0 4
Solution 4:
The apply
function can be used to perform row-wise calculations. Defining the function separately keeps things cleaner:
dstats <- function(x){
c(mean(x,na.rm=TRUE),
min(x, na.rm=TRUE),
max(x, na.rm=TRUE),
sum(x, na.rm=TRUE))
}
The function can now be applied over the rows of the data.table.
(dt[,
c("AVG", "MIN", "MAX", "SUM") := data.frame(t(apply(.SD, 1, dstats))),
.SDcols=c("Q1", "Q2","Q3","Q4"),
])
Notice that the only advantage of doing this with [.data.table
is that it allows the use of :=
for fast adding by reference.
This is slower but more flexible than the matrixStats
solution, and faster than the dplyr
solution by @ExperimenteR, clocking in at 36 seconds (my timings for the other methods were similar to those in @ExperimenteR's answer).