Converting nested list to dataframe
The goal is to convert a nested list which sometimes contain missing records into a data frame. An example of the structure when there are missing records is:
str(mylist)
List of 3
$ :List of 7
..$ Hit : chr "True"
..$ Project: chr "Blue"
..$ Year : chr "2011"
..$ Rating : chr "4"
..$ Launch : chr "26 Jan 2012"
..$ ID : chr "19"
..$ Dept : chr "1, 2, 4"
$ :List of 2
..$ Hit : chr "False"
..$ Error: chr "Record not found"
$ :List of 7
..$ Hit : chr "True"
..$ Project: chr "Green"
..$ Year : chr "2004"
..$ Rating : chr "8"
..$ Launch : chr "29 Feb 2004"
..$ ID : chr "183"
..$ Dept : chr "6, 8"
When there are no missing records the list can be converted into a data frame using data.frame(do.call(rbind.data.frame, mylist))
. However, when records are missing this results in a column mismatch. I know there are functions to merge data frames of non-matching columns but I'm yet to find one that can be applied to lists. The ideal outcome would keep record 2 with NA for all variables. Hoping for some help.
Edit to add dput(mylist)
:
list(structure(list(Hit = "True", Project = "Blue", Year = "2011",
Rating = "4", Launch = "26 Jan 2012", ID = "19", Dept = "1, 2, 4"), .Names = c("Hit",
"Project", "Year", "Rating", "Launch", "ID", "Dept")), structure(list(
Hit = "False", Error = "Record not found"), .Names = c("Hit",
"Error")), structure(list(Hit = "True", Project = "Green", Year = "2004",
Rating = "8", Launch = "29 Feb 2004", ID = "183", Dept = "6, 8"), .Names = c("Hit",
"Project", "Year", "Rating", "Launch", "ID", "Dept")))
You can also use (at least v1.9.3) of rbindlist
in the data.table
package:
library(data.table)
rbindlist(mylist, fill=TRUE)
## Hit Project Year Rating Launch ID Dept Error
## 1: True Blue 2011 4 26 Jan 2012 19 1, 2, 4 NA
## 2: False NA NA NA NA NA NA Record not found
## 3: True Green 2004 8 29 Feb 2004 183 6, 8 NA
You could create a list of data.frames:
dfs <- lapply(mylist, data.frame, stringsAsFactors = FALSE)
Then use one of these:
library(plyr)
rbind.fill(dfs)
or the faster
library(dplyr)
bind_rows(dfs) # in earlier versions: rbind_all(dfs)
In the case of dplyr::bind_rows
, I am surprised that it chooses to use ""
instead of NA
for missing data. If you remove stringsAsFactors = FALSE
, you will get NA
but at the cost of a warning... So suppressWarnings(rbind_all(lapply(mylist, data.frame)))
would be an ugly but fast solution.