How can I spread repeated measures of multiple variables into wide format?

r tidyr

Edit: I'm updating this answer since pivot_wider has been around for a while now and addresses the issue in this question and comments. You can now do

pivot_wider(
    dat, 
    id_cols = 'Person', 
    names_from = 'Time', 
    values_from = c('Score1', 'Score2', 'Score3'), 
    names_glue = '{Time}.{.value}'
)

to get the desired result.

The original answer was

dat %>% 
  gather(temp, score, starts_with("Score")) %>% 
  unite(temp1, Time, temp, sep = ".") %>% 
  spread(temp1, score)

Using reshape2:

library(reshape2)
dcast(melt(dat), Person ~ Time + variable)

Produces:

Using Person, Time as id variables
  Person Post_Score1 Post_Score2 Post_Score3 Pre_Score1 Pre_Score2 Pre_Score3
1   greg          79          78        83.5         83         81       87.0
2  sally          82          81        86.5         75         74       79.5
3    sue          78          78        83.0         82         79       85.5

Using dcast from the data.table package.

library(data.table)#v1.9.5+
dcast(setDT(dat), Person~Time, value.var=paste0("Score", 1:3))
#     Person Score1_Post Score1_Pre Score2_Post Score2_Pre Score3_Post Score3_Pre
#1:   greg          79         80          80         78        84.5       84.0
#2:  sally          78         75          78         74        83.0       79.5
#3:    sue          82         81          81         78        86.5       84.5

Or reshape from baseR

reshape(as.data.frame(dat), idvar='Person', timevar='Time',direction='wide')

Update

From development version tidyr_0.8.3.9000 or CRAN release tidyr_1.0.0, we can use pivot_wider for multiple value columns

library(tidyr)
library(stringr)
dat %>%
     pivot_wider(names_from = Time, values_from = str_c("Score", 1:3))
# A tibble: 3 x 7
#  Person Score1_Pre Score1_Post Score2_Pre Score2_Post Score3_Pre Score3_Post
#   <chr>       <dbl>       <dbl>      <dbl>       <dbl>      <dbl>       <dbl>
#1 greg           80          79         78          80       84          84.5
#2 sally          75          78         74          78       79.5        83  
#3 sue            81          82         78          81       84.5        86.5

I did a benchmark for myself and post it here in case someone is interested:

Code

The setup is chosen from the OP, three variables, two time points. However, the size of the data frames is varied from 1,000 to 100,000 rows.

library(magrittr)
library(data.table)
library(bench)

f1 <- function(dat) {
    tidyr::gather(dat, key = "key", value = "value", -Person, -Time) %>% 
        tidyr::unite("id", Time, key, sep = ".") %>%
        tidyr::spread(id, value)
}

f2 <- function(dat) {
    reshape2::dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable)
}

f3 <- function(dat) {
    dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable)
}

create_df <- function(rows) {
    dat <- expand.grid(Person = factor(1:ceiling(rows/2)),
                       Time = c("1Pre", "2Post"))
    dat$Score1 <- round(rnorm(nrow(dat), mean = 80, sd = 4), 0)
    dat$Score2 <- round(jitter(dat$Score1, 15), 0)
    dat$Score3 <- 5 + (dat$Score1 + dat$Score2)/2
    return(dat)
}

Results

As you can see, reshape2 is a little bit faster than tidyr, probably because tidyr has a larger overhead. Importantly, data.table excels with > 10,000 rows.

press(
    rows = 10^(3:5),
    {
        dat <- create_df(rows)
        dat2 <- copy(dat)
        setDT(dat2)
        bench::mark(tidyr     = f1(dat),
                    reshape2  = f2(dat),
                    datatable = f3(dat2),
                    check = function(x, y) all.equal(x, y, check.attributes = FALSE),
                    min_iterations = 20
        )
    }
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 9 x 11
#>   expression   rows      min     mean   median      max `itr/sec` mem_alloc
#>   <chr>       <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 tidyr        1000    5.7ms   6.13ms   6.02ms  10.06ms    163.      2.78MB
#> 2 reshape2     1000   2.82ms   3.09ms   2.97ms   8.67ms    323.       1.7MB
#> 3 datatable    1000   3.82ms      4ms   3.92ms   8.06ms    250.      2.78MB
#> 4 tidyr       10000  19.31ms  20.34ms  19.95ms  22.98ms     49.2     8.24MB
#> 5 reshape2    10000  13.81ms   14.4ms   14.4ms   15.6ms     69.4    11.34MB
#> 6 datatable   10000  14.56ms  15.16ms  14.91ms  18.93ms     66.0     2.98MB
#> 7 tidyr      100000 197.24ms 219.69ms 205.27ms 268.92ms      4.55   90.55MB
#> 8 reshape2   100000 164.02ms 195.32ms 176.31ms 284.77ms      5.12  121.69MB
#> 9 datatable  100000  51.31ms  60.34ms  58.36ms 113.69ms     16.6    27.36MB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>

^{Created on 2019-02-27 by the reprex package (v0.2.1)}

How can I spread repeated measures of multiple variables into wide format?

Update

Code

Results

Related

Recent Posts