Dummy variables from a string variable
I would like to create dummy variables form this dataset:
DF<-structure(list(A = c(1, 2, 3, 4, 5), B = c("1,3,2", "2,1,3,6",
"3,2,5,1,7", "3,7,4,2,6,5", "4,10,7,3,5,6")), .Names = c("A", "B"),
row.names = c(NA, 5L), class = "data.frame")
> DF
A B
1 1 1,3,2
2 2 2,1,3,6
3 3 3,2,5,1,7
4 4 3,7,4,2,6,5
5 5 4,10,7,3,5,6
Desired output shoud look like this:
A 1 2 3 4 5 6 7 8 9 10
1 1 1 1 0 0 0 0 0 0 0
2 1 1 1 0 0 1 0 0 0 0
3 1 1 1 0 1 0 1 0 0 0
4 0 1 1 1 1 1 1 0 0 0
5 0 0 1 1 1 1 1 0 0 1
Is there a efficient way to do such thing? I can use strsplit
or ifelse
. Original dataset is very large with many rows (>10k) and values in column B (>15k). Function dummy
from package dummies
don't work as I want to.
I also found simmilar case: Splitting one column into multiple columns. But the anwsers from the link above work really slow in my case (up to 15 minutes on my Dell i7-2630QM, 8Gb, Win7 64 bit, R 2.15.3 64bit).
Thank you in advance for your anwsers.
UPDATE
The function mentioned here has now been moved to a package available on CRAN called "splitstackshape". The version on CRAN is considerably faster than this original version. The speeds should be similar to what you would get with the direct for
loop solution at the end of this answer. See @Ricardo's answer for detailed benchmarks.
Install it, and use concat.split.expanded
to get the desired result:
library(splitstackshape)
concat.split.expanded(DF, "B", fill = 0, drop = TRUE)
# A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1
Original post
A while ago, I had written a function to do not just this sort of splitting, but others. The function, named concat.split()
, can be found here.
The usage, for your example data, would be:
## Keeping the original column
concat.split(DF, "B", structure="expanded")
# A B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1,3,2 1 1 1 NA NA NA NA NA NA NA
# 2 2 2,1,3,6 1 1 1 NA NA 1 NA NA NA NA
# 3 3 3,2,5,1,7 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 3,7,4,2,6,5 NA 1 1 1 1 1 1 NA NA NA
# 5 5 4,10,7,3,5,6 NA NA 1 1 1 1 1 NA NA 1
## Dropping the original column
concat.split(DF, "B", structure="expanded", drop.col=TRUE)
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 NA NA NA NA NA NA NA
# 2 2 1 1 1 NA NA 1 NA NA NA NA
# 3 3 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 NA 1 1 1 1 1 1 NA NA NA
# 5 5 NA NA 1 1 1 1 1 NA NA 1
Recoding NA to 0 has to be done manually--perhaps I'll update the function to add an option to do so, and at the same time, implement one of these faster solutions :)
temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE)
temp[is.na(temp)] <- 0
temp
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1
Update
Most of the overhead in the concat.split
function probably comes in things like converting from a matrix
to a data.frame
, renaming the columns, and so on. The actual code used to do the splitting is a GASP for
loop, but test it out, and you'll find that it performs pretty well:
b = strsplit(DF$B, ",")
ncol = max(as.numeric(unlist(b)))
temp = lapply(b, as.numeric)
## Set up an empty matrix
m = matrix(0, nrow = nrow(DF), ncol = ncol)
## Fill it in
for (i in 1:nrow(DF)) {
m[i, temp[[i]]] = 1
}
## View your result
m
Update:
Added benchmarks below
Update2: added bechmarks for @Anada's solution. WOW it's fast!!
Added benchmarks for an evern larger data set and @Anada's solution speeds ahead by a larger margin. '
Original Answer:
As you can see below, KnownMax
and UnknownMax
are outperforming even the data.table
solution. Although, I suspect that if there were 10e6+ rows, then the data.table
solution would be fastest. (feel free to benchmark it by simply modifying the parameters at the very bottom of this post)
Solution 1: KnownMax
If you know the maximum value in B, then you have a nice, two-liner:
maximum <- 10
results <- t(sapply(strsplit(DF$B, ","), `%in%`, x=1:maximum)) + 0
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 1 1 1 0 0 0 0 0 0 0
# [2,] 1 1 1 0 0 1 0 0 0 0
# [3,] 1 1 1 0 1 0 1 0 0 0
# [4,] 0 1 1 1 1 1 1 0 0 0
# [5,] 0 0 1 1 1 1 1 0 0 1
Three lines, if you want to name the columns and rows:
dimnames(results) <- list(seq(nrow(results)), seq(ncol(results)))
Solution 2: UnknownMax
# if you do not know the maximum ahead of time:
splat <- strsplit(DF$B, ",")
maximum <- max(as.numeric(unlist(splat)))
t(sapply(splat, `%in%`, x=1:maximum)) + 0
Solution 3: DT
As per @dickoa's request, here is an option with data.table
. '
DT <- data.table(DF)
DT.long <- DT[, list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]
cols <- DT.long[, max(vals)]
rows <- DT.long[, max(A)]
matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols,
byrow=TRUE, dimnames=list(seq(rows), seq(cols)))
# 1 2 3 4 5 6 7 8 9 10
# 1 1 1 1 0 0 0 0 0 0 0
# 2 1 1 1 0 0 1 0 0 0 0
# 3 1 1 1 0 1 0 1 0 0 0
# 4 0 1 1 1 1 1 1 0 0 0
# 5 0 0 1 1 1 1 1 0 0 1
Similar setup can be done in base R
as well
===
Here are some benchmarks with slightly larger data:
microbenchmark(KnownMax = eval(KnownMax), UnknownMax = eval(UnknownMax),
DT.withAssign = eval(DT.withAssign),
DT.withOutAssign = eval(DT.withOutAssign),
lapply.Dickoa = eval(lapply.Dickoa), apply.SimonO101 = eval(apply.SimonO101),
forLoop.Ananda = eval(forLoop.Ananda), times=50L)
Using the OP data.frame, where the result is 5 x 10
Unit: microseconds
expr min lq median uq max neval
KnownMax 106.556 114.692 122.4915 129.406 6427.521 50
UnknownMax 114.470 122.561 128.9780 136.384 158.346 50
DT.withAssign 3000.777 3099.729 3198.8175 3291.284 10415.315 50
DT.withOutAssign 2637.023 2739.930 2814.0585 2903.904 9376.747 50
lapply.Dickoa 7031.791 7315.781 7438.6835 7634.647 14314.687 50
apply.SimonO101 430.350 465.074 487.9505 522.938 7568.442 50
forLoop.Ananda 81.415 91.027 99.7530 104.588 265.394 50
Using the slightly larger data.frame (below) where the results is 1000 x 100
removing lapply.Dickoa
as my edit might have slowed it down and as it stood it crashed.
Unit: milliseconds
expr min lq median uq max neval
KnownMax 34.83210 35.59068 36.13330 38.15960 52.27746 50
UnknownMax 36.41766 37.17553 38.03075 47.71438 55.57009 50
DT.withAssign 31.95005 32.65798 33.73578 43.71493 50.05831 50
DT.withOutAssign 31.36063 32.08138 32.80728 35.32660 51.00037 50
apply.SimonO101 78.61677 91.72505 95.53592 103.36052 163.14346 50
forLoop.Ananda 13.61827 14.02197 14.18899 14.58777 26.42266 50
Even larger set where the results is 10,000 x 600
Unit: milliseconds
expr min lq median uq max neval
KnownMax 1583.5902 1631.6214 1658.6168 1724.9557 1902.3923 50
UnknownMax 1597.1215 1655.9634 1690.7550 1735.5913 1804.2156 50
DT.withAssign 586.4675 641.7206 660.7330 716.0100 1193.4806 50
DT.withOutAssign 587.0492 628.3731 666.3148 717.5575 776.2671 50
apply.SimonO101 1916.6589 1995.2851 2044.9553 2079.6754 2385.1028 50
forLoop.Ananda 163.4549 172.5627 182.6207 211.9153 315.0706 50
Using the following:
library(microbmenchmark)
library(data.table)
KnownMax <- quote(t(sapply(strsplit(DF$B, ","), `%in%`, x=1:maximum)) + 0)
UnknownMax <- quote({ splat <- strsplit(DF$B, ","); maximum <- max(as.numeric(unlist(splat))); t(sapply(splat, `%in%`, x=1:maximum)) + 0})
DT.withAssign <- quote({DT <- data.table(DF); DT.long <- DT[, list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]; cols <- DT.long[, max(vals)]; rows <- DT.long[, max(A)] ; matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols, byrow=TRUE, dimnames=list(seq(rows), seq(cols)))})
DT.withOutAssign <- quote({DT.long <- DT[, list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]; cols <- DT.long[, max(vals)]; rows <- DT.long[, max(A)] ; matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols, byrow=TRUE, dimnames=list(seq(rows), seq(cols)))})
lapply.Dickoa <- quote({ tmp <- strsplit(DF$B, ","); label <- 1:max(as.numeric(unlist(tmp))); tmp <- lapply(tmp, function(x) as.data.frame(lapply(label, function(y) (x == y)))); unname(t(sapply(tmp, colSums))) })
apply.SimonO101 <- quote({cols <- 1:max( as.numeric( unlist(strsplit(DF$B,",")))); t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) ) })
forLoop.Ananda <- quote({b = strsplit(DF$B, ","); ncol = max(as.numeric(unlist(b))); temp = lapply(b, as.numeric); m = matrix(0, nrow = nrow(DF), ncol = ncol) ; for (i in 1:nrow(DF)) { m[i, temp[[i]]] = 1 }; m })
# slightly modified @Dickoa's alogrithm to allow for instances were B is only a single number.
# Instead of using `sapply(.)`, I used `as.data.frame(lapply(.))` which hopefully the simplification process in sapply is analogous in time to `as.data.frame`
identical(eval(lapply.Dickoa), eval(UnknownMax))
identical(eval(lapply.Dickoa), unname(eval(apply.SimonO101)))
identical(eval(lapply.Dickoa), eval(KnownMax))
identical(unname(as.matrix(eval(DT.withAssign))), eval(KnownMax))
# ALL TRUE
this is what was used to create the sample data:
# larger data created as follows
set.seed(1)
maximum <- 600
rows <- 10000
DF <- data.frame(A=seq(rows), B=as.character(c(maximum, replicate(rows-1, paste(sample(maximum, sample(20), FALSE), collapse=",")))), stringsAsFactors=FALSE)
DT <- data.table(DF);
DT
One way you could do this with ifelse
and strsplit
(unless I misunderstood and you don't want to use them?) is like this....
cols <- 1:max( as.numeric( unlist(strsplit(DF$B,","))))
df <- t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) )
colnames(df) <- cols
df
# 1 2 3 4 5 6 7 8 9 10
#1 1 1 1 0 0 0 0 0 0 0
#2 1 1 1 0 0 1 0 0 0 0
#3 1 1 1 0 1 0 1 0 0 0
#4 0 1 1 1 1 1 1 0 0 0
#5 0 0 1 1 1 1 1 0 0 1
The idea is that we get a vector of the unique values in your desired column, find the max
value and create a vector 1:max(value)
then apply across each row to find out which values for that row are in the vector of all values. We use ifelse
to put a 1 if it's there and 0 if it's not. The vector
we match in is a sequence so its output is ready sorted.