How can I spread repeated measures of multiple variables into wide format?
Edit: I'm updating this answer since pivot_wider has been around for a while now and addresses the issue in this question and comments. You can now do
pivot_wider(
dat,
id_cols = 'Person',
names_from = 'Time',
values_from = c('Score1', 'Score2', 'Score3'),
names_glue = '{Time}.{.value}'
)
to get the desired result.
The original answer was
dat %>%
gather(temp, score, starts_with("Score")) %>%
unite(temp1, Time, temp, sep = ".") %>%
spread(temp1, score)
Using reshape2
:
library(reshape2)
dcast(melt(dat), Person ~ Time + variable)
Produces:
Using Person, Time as id variables
Person Post_Score1 Post_Score2 Post_Score3 Pre_Score1 Pre_Score2 Pre_Score3
1 greg 79 78 83.5 83 81 87.0
2 sally 82 81 86.5 75 74 79.5
3 sue 78 78 83.0 82 79 85.5
Using dcast
from the data.table
package.
library(data.table)#v1.9.5+
dcast(setDT(dat), Person~Time, value.var=paste0("Score", 1:3))
# Person Score1_Post Score1_Pre Score2_Post Score2_Pre Score3_Post Score3_Pre
#1: greg 79 80 80 78 84.5 84.0
#2: sally 78 75 78 74 83.0 79.5
#3: sue 82 81 81 78 86.5 84.5
Or reshape
from baseR
reshape(as.data.frame(dat), idvar='Person', timevar='Time',direction='wide')
Update
From development version tidyr_0.8.3.9000
or CRAN release tidyr_1.0.0
, we can use pivot_wider
for multiple value columns
library(tidyr)
library(stringr)
dat %>%
pivot_wider(names_from = Time, values_from = str_c("Score", 1:3))
# A tibble: 3 x 7
# Person Score1_Pre Score1_Post Score2_Pre Score2_Post Score3_Pre Score3_Post
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 greg 80 79 78 80 84 84.5
#2 sally 75 78 74 78 79.5 83
#3 sue 81 82 78 81 84.5 86.5
I did a benchmark for myself and post it here in case someone is interested:
Code
The setup is chosen from the OP, three variables, two time points. However, the size of the data frames is varied from 1,000 to 100,000 rows.
library(magrittr)
library(data.table)
library(bench)
f1 <- function(dat) {
tidyr::gather(dat, key = "key", value = "value", -Person, -Time) %>%
tidyr::unite("id", Time, key, sep = ".") %>%
tidyr::spread(id, value)
}
f2 <- function(dat) {
reshape2::dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable)
}
f3 <- function(dat) {
dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable)
}
create_df <- function(rows) {
dat <- expand.grid(Person = factor(1:ceiling(rows/2)),
Time = c("1Pre", "2Post"))
dat$Score1 <- round(rnorm(nrow(dat), mean = 80, sd = 4), 0)
dat$Score2 <- round(jitter(dat$Score1, 15), 0)
dat$Score3 <- 5 + (dat$Score1 + dat$Score2)/2
return(dat)
}
Results
As you can see, reshape2 is a little bit faster than tidyr, probably because tidyr has a larger overhead. Importantly, data.table excels with > 10,000 rows.
press(
rows = 10^(3:5),
{
dat <- create_df(rows)
dat2 <- copy(dat)
setDT(dat2)
bench::mark(tidyr = f1(dat),
reshape2 = f2(dat),
datatable = f3(dat2),
check = function(x, y) all.equal(x, y, check.attributes = FALSE),
min_iterations = 20
)
}
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 9 x 11
#> expression rows min mean median max `itr/sec` mem_alloc
#> <chr> <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 tidyr 1000 5.7ms 6.13ms 6.02ms 10.06ms 163. 2.78MB
#> 2 reshape2 1000 2.82ms 3.09ms 2.97ms 8.67ms 323. 1.7MB
#> 3 datatable 1000 3.82ms 4ms 3.92ms 8.06ms 250. 2.78MB
#> 4 tidyr 10000 19.31ms 20.34ms 19.95ms 22.98ms 49.2 8.24MB
#> 5 reshape2 10000 13.81ms 14.4ms 14.4ms 15.6ms 69.4 11.34MB
#> 6 datatable 10000 14.56ms 15.16ms 14.91ms 18.93ms 66.0 2.98MB
#> 7 tidyr 100000 197.24ms 219.69ms 205.27ms 268.92ms 4.55 90.55MB
#> 8 reshape2 100000 164.02ms 195.32ms 176.31ms 284.77ms 5.12 121.69MB
#> 9 datatable 100000 51.31ms 60.34ms 58.36ms 113.69ms 16.6 27.36MB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>
Created on 2019-02-27 by the reprex package (v0.2.1)