Speeding up the performance of write.table
I have a data.frame
and I want to write it out. The dimensions of my data.frame
are 256 rows by 65536 columns. What are faster alternatives to write.csv
?
data.table::fwrite()
was contributed by Otto Seiskari and is available in versions 1.9.8+. Matt has made additional enhancements on top (including parallelisation) and wrote an article about it. Please report any issues on the tracker.
First, here's a comparison on the same dimensions used by @chase above (i.e., a very large number of columns: 65,000 columns (!) x 256 rows), together with fwrite
and write_feather
, so that we have some consistency across machines. Note the huge difference compress=FALSE
makes in base R.
# -----------------------------------------------------------------------------
# function | object type | output type | compress= | Runtime | File size |
# -----------------------------------------------------------------------------
# save | matrix | binary | FALSE | 0.3s | 134MB |
# save | data.frame | binary | FALSE | 0.4s | 135MB |
# feather | data.frame | binary | FALSE | 0.4s | 139MB |
# fwrite | data.table | csv | FALSE | 1.0s | 302MB |
# save | matrix | binary | TRUE | 17.9s | 89MB |
# save | data.frame | binary | TRUE | 18.1s | 89MB |
# write.csv | matrix | csv | FALSE | 21.7s | 302MB |
# write.csv | data.frame | csv | FALSE | 121.3s | 302MB |
Note that fwrite()
runs in parallel. The timing shown here is on a 13' Macbook Pro with 2 cores and 1 thread/core (+2 virtual threads via hyperthreading), 512GB SSD, 256KB/core L2 cache and 4MB L4 cache. Depending on your system spec, YMMV.
I also reran the benchmarks on relatively more likely (and bigger) data:
library(data.table)
NN <- 5e6 # at this number of rows, the .csv output is ~800Mb on my machine
set.seed(51423)
DT <- data.table(
str1 = sample(sprintf("%010d",1:NN)), #ID field 1
str2 = sample(sprintf("%09d",1:NN)), #ID field 2
# varying length string field--think names/addresses, etc.
str3 = replicate(NN,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")),
# factor-like string field with 50 "levels"
str4 = sprintf("%05d",sample(sample(1e5,50),NN,T)),
# factor-like string field with 17 levels, varying length
str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T),
collapse="")),NN,T),
# lognormally distributed numeric
num1 = round(exp(rnorm(NN,mean=6.5,sd=1.5)),2),
# 3 binary strings
str6 = sample(c("Y","N"),NN,T),
str7 = sample(c("M","F"),NN,T),
str8 = sample(c("B","W"),NN,T),
# right-skewed (integer type)
int1 = as.integer(ceiling(rexp(NN))),
num2 = round(exp(rnorm(NN,mean=6,sd=1.5)),2),
# lognormal numeric that can be positive or negative
num3 = (-1)^sample(2,NN,T)*round(exp(rnorm(NN,mean=6,sd=1.5)),2))
# -------------------------------------------------------------------------------
# function | object | out | other args | Runtime | File size |
# -------------------------------------------------------------------------------
# fwrite | data.table | csv | quote = FALSE | 1.7s | 523.2MB |
# fwrite | data.frame | csv | quote = FALSE | 1.7s | 523.2MB |
# feather | data.frame | bin | no compression | 3.3s | 635.3MB |
# save | data.frame | bin | compress = FALSE | 12.0s | 795.3MB |
# write.csv | data.frame | csv | row.names = FALSE | 28.7s | 493.7MB |
# save | data.frame | bin | compress = TRUE | 48.1s | 190.3MB |
# -------------------------------------------------------------------------------
So fwrite
is ~2x faster than feather
in this test. This was run on the same machine as noted above with fwrite
running in parallel on 2 cores.
feather
seems quite fast binary format as well, but no compression yet.
Here's an attempt at showing how fwrite
compares with respect to scale:
NB: the benchmark has been updated by running base R's save()
with compress = FALSE
(since feather also is not compressed).
So fwrite
is fastest of all of them on this data (running on 2 cores) plus it creates a .csv
which can easily be viewed, inspected and passed to grep
, sed
etc.
Code for reproduction:
require(data.table)
require(microbenchmark)
require(feather)
ns <- as.integer(10^seq(2, 6, length.out = 25))
DTn <- function(nn)
data.table(
str1 = sample(sprintf("%010d",1:nn)),
str2 = sample(sprintf("%09d",1:nn)),
str3 = replicate(nn,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")),
str4 = sprintf("%05d",sample(sample(1e5,50),nn,T)),
str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T), collapse="")),nn,T),
num1 = round(exp(rnorm(nn,mean=6.5,sd=1.5)),2),
str6 = sample(c("Y","N"),nn,T),
str7 = sample(c("M","F"),nn,T),
str8 = sample(c("B","W"),nn,T),
int1 = as.integer(ceiling(rexp(nn))),
num2 = round(exp(rnorm(nn,mean=6,sd=1.5)),2),
num3 = (-1)^sample(2,nn,T)*round(exp(rnorm(nn,mean=6,sd=1.5)),2))
count <- data.table(n = ns,
c = c(rep(1000, 12),
rep(100, 6),
rep(10, 7)))
mbs <- lapply(ns, function(nn){
print(nn)
set.seed(51423)
DT <- DTn(nn)
microbenchmark(times = count[n==nn,c],
write.csv=write.csv(DT, "writecsv.csv", quote=FALSE, row.names=FALSE),
save=save(DT, file = "save.RData", compress=FALSE),
fwrite=fwrite(DT, "fwrite_turbo.csv", quote=FALSE, sep=","),
feather=write_feather(DT, "feather.feather"))})
png("microbenchmark.png", height=600, width=600)
par(las=2, oma = c(1, 0, 0, 0))
matplot(ns, t(sapply(mbs, function(x) {
y <- summary(x)[,"median"]
y/y[3]})),
main = "Relative Speed of fwrite (turbo) vs. rest",
xlab = "", ylab = "Time Relative to fwrite (turbo)",
type = "l", lty = 1, lwd = 2,
col = c("red", "blue", "black", "magenta"), xaxt = "n",
ylim=c(0,25), xlim=c(0, max(ns)))
axis(1, at = ns, labels = prettyNum(ns, ","))
mtext("# Rows", side = 1, las = 1, line = 5)
legend("right", lty = 1, lwd = 3,
legend = c("write.csv", "save", "feather"),
col = c("red", "blue", "magenta"))
dev.off()
If all of your columns are of the same class, convert to a matrix before writing out, provides a nearly 6x speed up. Also, you can look into using write.matrix()
from package MASS
, though it did not prove faster for this example. Maybe I didn't set something up properly:
#Fake data
m <- matrix(runif(256*65536), nrow = 256)
#AS a data.frame
system.time(write.csv(as.data.frame(m), "dataframe.csv"))
#----------
# user system elapsed
# 319.53 13.65 333.76
#As a matrix
system.time(write.csv(m, "matrix.csv"))
#----------
# user system elapsed
# 52.43 0.88 53.59
#Using write.matrix()
require(MASS)
system.time(write.matrix(m, "writematrix.csv"))
#----------
# user system elapsed
# 113.58 59.12 172.75
EDIT
To address the concern raised below that the results above are not fair to data.frame, here are some more results and timing to show that the overall message is still "convert your data object to a matrix if possible. If not possible, deal with it. Alternatively, reconsider why you need to write out a 200MB+ file in CSV format if the timing is of the utmost importance":
#This is a data.frame
m2 <- as.data.frame(matrix(runif(256*65536), nrow = 256))
#This is still 6x slower
system.time(write.csv(m2, "dataframe.csv"))
# user system elapsed
# 317.85 13.95 332.44
#This even includes the overhead in converting to as.matrix in the timing
system.time(write.csv(as.matrix(m2), "asmatrix.csv"))
# user system elapsed
# 53.67 0.92 54.67
So, nothing really changes. To confirm this is reasonable, consider the relative time costs of as.data.frame()
:
m3 <- as.matrix(m2)
system.time(as.data.frame(m3))
# user system elapsed
# 0.77 0.00 0.77
So, not really a big deal or skewing information as much as the comment below would believe. If you're still not convinced that using write.csv()
on large data.frames is a bad idea performance wise, consult the manual under the Note
:
write.table can be slow for data frames with large numbers (hundreds or more) of
columns: this is inevitable as each column could be of a different class and so must be
handled separately. If they are all of the same class, consider using a matrix instead.
Finally, consider moving to a native RData object if you're still losing sleep over saving things faster
system.time(save(m2, file = "thisisfast.RData"))
# user system elapsed
# 21.67 0.12 21.81
Another option is to use the feather file format.
df <- as.data.frame(matrix(runif(256*65536), nrow = 256))
system.time(feather::write_feather(df, "df.feather"))
#> user system elapsed
#> 0.237 0.355 0.617
Feather is a binary file format designed to be very efficient to read and write. It's designed to work with multiple languages: there are currently R and python clients, and a julia client is in the works.
For comparison, here's how long saveRDS
takes:
system.time(saveRDS(df, "df.rds"))
#> user system elapsed
#> 17.363 0.307 17.856
Now, this is a somewhat unfair comparison because the default for saveRDS
is to compress the data, and here the data is incompressible because it's completely random. Turning compression off makes saveRDS
significantly faster:
system.time(saveRDS(df, "df.rds", compress = FALSE))
#> user system elapsed
#> 0.181 0.247 0.473
And indeed it's now slightly faster than feather. So why use feather? Well, it's typically faster than readRDS()
, and you usually write the data relatively few times compared to the number of times that you read it.
system.time(readRDS("df.rds"))
#> user system elapsed
#> 0.198 0.090 0.287
system.time(feather::read_feather("df.feather"))
#> user system elapsed
#> 0.125 0.060 0.185
The fst
package
A more recent option for very fast reading and writing of data files is the fst
package. fst
generates files in a binary format.
Use write.fst(dat, "file.fst", compress=0)
, where compress
can go from 0 (no compression) to 100 (maximum compression). Data can be read back into R with dat = read.fst("file.fst")
. Based on the timings listed at the package website, it's faster than feather
, data.table
and base R readRDS
and writeRDS
.
The package development site warns that the fst
data format is still evolving and that fst
should therefore not yet be used for long-term data storage.