Tricks to manage the available memory in an R session
Ensure you record your work in a reproducible script. From time-to-time, reopen R, then source()
your script. You'll clean out anything you're no longer using, and as an added benefit will have tested your code.
I use the data.table package. With its :=
operator you can :
- Add columns by reference
- Modify subsets of existing columns by reference, and by group by reference
- Delete columns by reference
None of these operations copy the (potentially large) data.table
at all, not even once.
- Aggregation is also particularly fast because
data.table
uses much less working memory.
Related links :
- News from data.table, London R presentation, 2012
- When should I use the
:=
operator in data.table?
Saw this on a twitter post and think it's an awesome function by Dirk! Following on from JD Long's answer, I would do this for user friendly reading:
# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.prettysize <- napply(names, function(x) {
format(utils::object.size(x), units = "auto") })
obj.size <- napply(names, object.size)
obj.dim <- t(napply(names, function(x)
as.numeric(dim(x))[1:2]))
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
names(out) <- c("Type", "Size", "PrettySize", "Length/Rows", "Columns")
if (!missing(order.by))
out <- out[order(out[[order.by]], decreasing=decreasing), ]
if (head)
out <- head(out, n)
out
}
# shorthand
lsos <- function(..., n=10) {
.ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}
lsos()
Which results in something like the following:
Type Size PrettySize Length/Rows Columns
pca.res PCA 790128 771.6 Kb 7 NA
DF data.frame 271040 264.7 Kb 669 50
factor.AgeGender factanal 12888 12.6 Kb 12 NA
dates data.frame 9016 8.8 Kb 669 2
sd. numeric 3808 3.7 Kb 51 NA
napply function 2256 2.2 Kb NA NA
lsos function 1944 1.9 Kb NA NA
load loadings 1768 1.7 Kb 12 2
ind.sup integer 448 448 bytes 102 NA
x character 96 96 bytes 1 NA
NOTE: The main part I added was (again, adapted from JD's answer) :
obj.prettysize <- napply(names, function(x) {
print(object.size(x), units = "auto") })
I make aggressive use of the subset
parameter with selection of only the required variables when passing dataframes to the data=
argument of regression functions. It does result in some errors if I forget to add variables to both the formula and the select=
vector, but it still saves a lot of time due to decreased copying of objects and reduces the memory footprint significantly. Say I have 4 million records with 110 variables (and I do.) Example:
# library(rms); library(Hmisc) for the cph,and rcs functions
Mayo.PrCr.rbc.mdl <-
cph(formula = Surv(surv.yr, death) ~ age + Sex + nsmkr + rcs(Mayo, 4) +
rcs(PrCr.rat, 3) + rbc.cat * Sex,
data = subset(set1HLI, gdlab2 & HIVfinal == "Negative",
select = c("surv.yr", "death", "PrCr.rat", "Mayo",
"age", "Sex", "nsmkr", "rbc.cat")
) )
By way of setting context and the strategy: the gdlab2
variable is a logical vector that was constructed for subjects in a dataset that had all normal or almost normal values for a bunch of laboratory tests and HIVfinal
was a character vector that summarized preliminary and confirmatory testing for HIV.