What is the most efficient way to cast a list as a data frame?
Very often I want to convert a list wherein each index has identical element types to a data frame. For example, I may have a list:
> my.list
[[1]]
[[1]]$global_stdev_ppb
[1] 24267673
[[1]]$range
[1] 0.03114799
[[1]]$tok
[1] "hello"
[[1]]$global_freq_ppb
[1] 211592.6
[[2]]
[[2]]$global_stdev_ppb
[1] 11561448
[[2]]$range
[1] 0.08870838
[[2]]$tok
[1] "world"
[[2]]$global_freq_ppb
[1] 1002043
I want to convert this list to a data frame where each index element is a column. The natural (to me) thing to go is to is use do.call
:
> my.matrix<-do.call("rbind", my.list)
> my.matrix
global_stdev_ppb range tok global_freq_ppb
[1,] 24267673 0.03114799 "hello" 211592.6
[2,] 11561448 0.08870838 "world" 1002043
Straightforward enough, but when I attempt to cast this matrix as a data frame, the columns remain list elements, rather than vectors:
> my.df<-as.data.frame(my.matrix, stringsAsFactors=FALSE)
> my.df[,1]
[[1]]
[1] 24267673
[[2]]
[1] 11561448
Currently, to get the data frame cast properly I am iterating over each column using unlist
and as.vector
, then recasting the data frame as such:
new.list<-lapply(1:ncol(my.matrix), function(x) as.vector(unlist(my.matrix[,x])))
my.df<-as.data.frame(do.call(cbind, new.list), stringsAsFactors=FALSE)
This, however, seem very inefficient. Is there are better way to do this?
I think you want:
> do.call(rbind, lapply(my.list, data.frame, stringsAsFactors=FALSE))
global_stdev_ppb range tok global_freq_ppb
1 24267673 0.03114799 hello 211592.6
2 11561448 0.08870838 world 1002043.0
> str(do.call(rbind, lapply(my.list, data.frame, stringsAsFactors=FALSE)))
'data.frame': 2 obs. of 4 variables:
$ global_stdev_ppb: num 24267673 11561448
$ range : num 0.0311 0.0887
$ tok : chr "hello" "world"
$ global_freq_ppb : num 211593 1002043
Another option is:
data.frame(t(sapply(mylist, `[`)))
but this simple manipulation results in a data frame of lists:
> str(data.frame(t(sapply(mylist, `[`))))
'data.frame': 2 obs. of 3 variables:
$ a:List of 2
..$ : num 1
..$ : num 2
$ b:List of 2
..$ : num 2
..$ : num 3
$ c:List of 2
..$ : chr "a"
..$ : chr "b"
An alternative to this, along the same lines but now the result same as the other solutions, is:
data.frame(lapply(data.frame(t(sapply(mylist, `[`))), unlist))
[Edit: included timings of @Martin Morgan's two solutions, which have the edge over the other solution that return a data frame of vectors.] Some representative timings on a very simple problem:
mylist <- list(list(a = 1, b = 2, c = "a"), list(a = 2, b = 3, c = "b"))
> ## @Joshua Ulrich's solution:
> system.time(replicate(1000, do.call(rbind, lapply(mylist, data.frame,
+ stringsAsFactors=FALSE))))
user system elapsed
1.740 0.001 1.750
> ## @JD Long's solution:
> system.time(replicate(1000, do.call(rbind, lapply(mylist, data.frame))))
user system elapsed
2.308 0.002 2.339
> ## my sapply solution No.1:
> system.time(replicate(1000, data.frame(t(sapply(mylist, `[`)))))
user system elapsed
0.296 0.000 0.301
> ## my sapply solution No.2:
> system.time(replicate(1000, data.frame(lapply(data.frame(t(sapply(mylist, `[`))),
+ unlist))))
user system elapsed
1.067 0.001 1.091
> ## @Martin Morgan's Map() sapply() solution:
> f = function(x) function(i) sapply(x, `[[`, i)
> system.time(replicate(1000, as.data.frame(Map(f(mylist), names(mylist[[1]])))))
user system elapsed
0.775 0.000 0.778
> ## @Martin Morgan's Map() lapply() unlist() solution:
> f = function(x) function(i) unlist(lapply(x, `[[`, i), use.names=FALSE)
> system.time(replicate(1000, as.data.frame(Map(f(mylist), names(mylist[[1]])))))
user system elapsed
0.653 0.000 0.658
I can't tell you this is the "most efficient" in terms of memory or speed, but it's pretty efficient in terms of coding:
my.df <- do.call("rbind", lapply(my.list, data.frame))
the lapply() step with data.frame() turns each list item into a single row data frame which then acts nice with rbind()
Although this question has long since been answered, it's worth pointing out the data.table
package has rbindlist
which accomplishes this task very quickly:
library(microbenchmark)
library(data.table)
l <- replicate(1E4, list(a=runif(1), b=runif(1), c=runif(1)), simplify=FALSE)
microbenchmark( times=5,
R=as.data.frame(Map(f(l), names(l[[1]]))),
dt=data.frame(rbindlist(l))
)
gives me
Unit: milliseconds
expr min lq median uq max neval
R 31.060119 31.403943 32.278537 32.370004 33.932700 5
dt 2.271059 2.273157 2.600976 2.635001 2.729421 5