Split a column of concatenated comma-delimited data and recode output as factors
Solution 1:
You just need to write a function and use apply
. First some dummy data:
##Make sure you're not using factors
dd = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5",
"1, 3, 4", "1, 3, 5", "2, 3, 4, 5"),
stringsAsFactors=FALSE)
Next, create a function that takes in a row and transforms as necessary
make_row = function(i, ncol=5) {
##Could make the default NA if needed
m = numeric(ncol)
v = as.numeric(strsplit(i, ",")[[1]])
m[v] = 1
return(m)
}
Then use apply
and transpose the result
t(apply(dd, 1, make_row))
Solution 2:
A long time later, I finally got around to creating a package ("splitstackshape") that deals with this kind of data in an efficient manner. So, for the convenience of others (and some self-promotion, of course) here's a compact solution.
The relevant function for this problem is cSplit_e
.
First, the default settings, which retains the original column and uses NA
as the fill:
library(splitstackshape)
cSplit_e(data, "V1")
# V1 V1_1 V1_2 V1_3 V1_4 V1_5
# 1 1, 2, 3 1 1 1 NA NA
# 2 1, 2, 4 1 1 NA 1 NA
# 3 2, 3, 4, 5 NA 1 1 1 1
# 4 1, 3, 4 1 NA 1 1 NA
# 5 1, 3, 5 1 NA 1 NA 1
# 6 2, 3, 4, 5 NA 1 1 1 1
Second, with dropping the original column and using 0
as the fill.
cSplit_e(data, "V1", drop = TRUE, fill = 0)
# V1_1 V1_2 V1_3 V1_4 V1_5
# 1 1 1 1 0 0
# 2 1 1 0 1 0
# 3 0 1 1 1 1
# 4 1 0 1 1 0
# 5 1 0 1 0 1
# 6 0 1 1 1 1