Python's xrange alternative for R OR how to loop over large dataset lazilly?
One (arguably more "proper") way to approach this would be to write your own iterator for iterators
that @BenBolker suggested (pdf on writing extensions is here). Lacking something more formal, here is a poor-man's iterator, similar to expand.grid
but manually-advancing. (Note: this will suffice given that the computation on each iteration is "more expensive" than this function itself. This could really be improved, but "it works".)
This function returns a named list (with the provided factors) each time the returned function is returned. It is lazy in that it does not expand the entire list of possibles; it is not lazy with the argument themselves, they should be 'consumed' immediately.
lazyExpandGrid <- function(...) {
dots <- list(...)
sizes <- sapply(dots, length, USE.NAMES = FALSE)
indices <- c(0, rep(1, length(dots)-1))
function() {
indices[1] <<- indices[1] + 1
DONE <- FALSE
while (any(rolls <- (indices > sizes))) {
if (tail(rolls, n=1)) return(FALSE)
indices[rolls] <<- 1
indices[ 1+which(rolls) ] <<- indices[ 1+which(rolls) ] + 1
}
mapply(`[`, dots, indices, SIMPLIFY = FALSE)
}
}
Sample usage:
nxt <- lazyExpandGrid(a=1:3, b=15:16, c=21:22)
nxt()
# a b c
# 1 1 15 21
nxt()
# a b c
# 1 2 15 21
nxt()
# a b c
# 1 3 15 21
nxt()
# a b c
# 1 1 16 21
## <yawn>
nxt()
# a b c
# 1 3 16 22
nxt()
# [1] FALSE
NB: for brevity of display, I used as.data.frame(mapply(...))
for the example; it works either way, but if a named list works fine for you then the conversion to a data.frame isn't necessary.
EDIT
Based on alexis_laz's answer, here's a much-improved version that is (a) much faster and (b) allows arbitrary seeking.
lazyExpandGrid <- function(...) {
dots <- list(...)
argnames <- names(dots)
if (is.null(argnames)) argnames <- paste0('Var', seq_along(dots))
sizes <- lengths(dots)
indices <- cumprod(c(1L, sizes))
maxcount <- indices[ length(indices) ]
i <- 0
function(index) {
i <<- if (missing(index)) (i + 1L) else index
if (length(i) > 1L) return(do.call(rbind.data.frame, lapply(i, sys.function(0))))
if (i > maxcount || i < 1L) return(FALSE)
setNames(Map(`[[`, dots, (i - 1L) %% indices[-1L] %/% indices[-length(indices)] + 1L ),
argnames)
}
}
It works with no arguments (auto-increment the internal counter), one argument (seek and set the internal counter), or a vector argument (seek to each and set the counter to the last, returns a data.frame).
This last use-case allows for sampling a subset of the design space:
set.seed(42)
nxt <- lazyExpandGrid2(a=1:1e2, b=1:1e2, c=1:1e2, d=1:1e2, e=1:1e2, f=1:1e2)
as.data.frame(nxt())
# a b c d e f
# 1 1 1 1 1 1 1
nxt(sample(1e2^6, size=7))
# a b c d e f
# 2 69 61 7 7 49 92
# 21 72 28 55 40 62 29
# 3 88 32 53 46 18 65
# 4 88 33 31 89 66 74
# 5 57 75 31 93 70 66
# 6 100 86 79 42 78 46
# 7 55 41 25 73 47 94
Thanks alexis_laz for the improvements of cumprod
, Map
, and index calculations!
Another approach that, somehow, looks valid..:
exp_gr = function(..., index)
{
args = list(...)
ns = lengths(args)
offs = cumprod(c(1L, ns))
n = offs[length(offs)]
stopifnot(index <= n)
i = (index[[1L]] - 1L) %% offs[-1L] %/% offs[-length(offs)]
return(do.call(data.frame,
setNames(Map("[[", args, i + 1L),
paste("Var", seq_along(args), sep = ""))))
}
In the above function, ...
are the arguments to expand.grid
and index
is the increasing number of combinations.
E.g.:
expand.grid(1:3, 10:12, 21:24, letters[2:5])[c(5, 22, 24, 35, 51, 120, 144), ]
# Var1 Var2 Var3 Var4
#5 2 11 21 b
#22 1 11 23 b
#24 3 11 23 b
#35 2 12 24 b
#51 3 11 22 c
#120 3 10 22 e
#144 3 12 24 e
do.call(rbind, lapply(c(5, 22, 24, 35, 51, 120, 144),
function(i) exp_gr(1:3, 10:12, 21:24, letters[2:5], index = i)))
# Var1 Var2 Var3 Var4
#1 2 11 21 b
#2 1 11 23 b
#3 3 11 23 b
#4 2 12 24 b
#5 3 11 22 c
#6 3 10 22 e
#7 3 12 24 e
And on large structures:
expand.grid(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2)
#Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
# invalid 'times' value
#In addition: Warning message:
#In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
# NAs introduced by coercion to integer range
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1)
# Var1 Var2 Var3 Var4 Var5 Var6
#1 1 1 1 1 1 1
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e3 + 487)
# Var1 Var2 Var3 Var4 Var5 Var6
#1 87 15 1 1 1 1
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e2 ^ 6)
# Var1 Var2 Var3 Var4 Var5 Var6
#1 100 100 100 100 100 100
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e11 + 154)
# Var1 Var2 Var3 Var4 Var5 Var6
#1 54 2 1 1 1 11
A similar approach to this would be to construct a "class" that stores the ...
arguments to use expand.grid
on and define a [
method to calculate the appropriate combination index when needed. Using %%
and %/%
seems valid, though, I guess iterating with these operators will be slower than it needs to be.