Practical limits of R data frame
I have been reading about how read.table is not efficient for large data files. Also how R is not suited for large data sets. So I was wondering where I can find what the practical limits are and any performance charts for (1) Reading in data of various sizes (2) working with data of varying sizes.
In effect, I want to know when the performance deteriorates and when I hit a road block. Also any comparison against C++/MATLAB or other languages would be really helpful. finally if there is any special performance comparison for Rcpp and RInside, that would be great!
R is suited for large data sets, but you may have to change your way of working somewhat from what the introductory textbooks teach you. I did a post on Big Data for R which crunches a 30 GB data set and which you may find useful for inspiration.
The usual sources for information to get started are High-Performance Computing Task View and the R-SIG HPC mailing list at R-SIG HPC.
The main limit you have to work around is a historic limit on the length of a vector to 2^31-1 elements which wouldn't be so bad if R did not store matrices as vectors. (The limit is for compatibility with some BLAS libraries.)
We regularly analyse telco call data records and marketing databases with multi-million customers using R, so would be happy to talk more if you are interested.
The physical limits arise from the use of 32-bit indexes on vectors. As a result, vectors up to 2^31 - 1 are allowed. Matrices are vectors with dimensions, so the product of nrow(mat)
and ncol(mat)
must be within 2^31 - 1. Data frames and lists are general vectors, so each component can take 2^31 - 1 entries, which for data frames means you can have that many rows and columns. For lists you can have 2^31 - 1 components, each of 2^31 - 1 elements. This is drawn from a recent posting by Duncan Murdoch in reply to a Q on R-Help
Now that all has to fit in RAM with standard R so that might be a more pressing limit, but the High-Performance Computing Task View that others have mentioned contains details of packages that can circumvent the in-memory issues.
1) The R Import / Export manual should be the first port of call for questions about importing data - there are many options and what will work for your could be very specific.
http://cran.r-project.org/doc/manuals/R-data.html
read.table
specifically has greatly improved performance if the options provided to it are used, particular colClasses
, comment.char
, and nrows
- this is because this information has to be inferred from the data itself, which can be costly.
2) There is a specific limit for the length (total number of elements) for any vector, matrix, array, column in a data.frame, or list. This is due to a 32-bit index used under the hood, and is true for 32-bit and 64-bit R. The number is 2^31 - 1. This is the maximum number of rows for a data.frame, but it is so large you are far more likely to run out of memory for even single vectors before you start collecting several of them.
See help(Memory-limits)
and help(Memory)
for details.
A single vector of that length will take many gigabytes of memory (depends on the type and storage mode of each vector - 17.1 for numeric) so it's unlikely to be a proper limit unless you are really pushing things. If you really need to push things past the available system memory (64-bit is mandatory here) then standard database techniques as discussed in the import/export manual, or memory-mapped file options (like the ff
package), are worth considering. The CRAN Task View High Performance Computing is a good resource for this end of things.
Finally, if you have stacks of RAM (16Gb or more) and need 64-bit indexing it might come in a future release of R. http://www.mail-archive.com/[email protected]/msg92035.html
Also, Ross Ihaka discusses some of the historical decisions and future directions for an R like language in papers and talks here: http://www.stat.auckland.ac.nz/~ihaka/?Papers_and_Talks
I can only answer the one about read.table
, since I don't have any experience with large data sets. read.table
performs poorly if you don't provide colClasses
arguments. Without it, read.table
defaults to NA
and tries to guess a class of every column, and that can be slow, especially when you have a lot of columns.