How to speed up combining columns when one column is just a repetition of the same value?

Solution 1:

If the problem as stated is merging the unique value in first column with the second column. If the first column is just a repeated value and the second column contains all unique values then a simple solution is:

data.frame(all_letters_combined=c(df[1,1], df[,2]))

If you need to remove duplicates (duplicates in column 2 or column 1 is duplicated in column 2) from the resulting column. Based on ekoam's observation that dplyr::distinct() is faster than unique() Then here an option:

distinct(data.frame(all_letters_combined=c(df[1,1], df[,2])))

Of course if there are more columns and the different possibilities of values than a more complex solution would be required.

Solution 2:

The bottleneck is unique, which becomes extremely costly when applied to a list of dataframes. distinct would be faster. On the other hand, if you already know that the dataframes are unique before pivoting them, giving each of them a unique id to preserve this relationship would be an even more ideal approach. That said, consider the following benchmark.

library(dplyr)
library(tidyr)

f1 <- . %>% pivot_longer(everything()) %>% select(value) %>% unique()
f2 <- . %>% pivot_longer(everything()) %>% select(value) %>% distinct()
f3 <- . %>% 
  rename(one_df = one_df, other_df = other_dfs) %>% 
  mutate(one_id = 0L, other_id = row_number()) %>% 
  pivot_longer(starts_with(c("one", "other")), c(NA, ".value"), names_sep = "_") %>% 
  distinct(id, .keep_all = TRUE)

microbenchmark::microbenchmark(f1(bigger_tib), f2(bigger_tib), f3(bigger_tib), times = 10L)

Output

> f3(bigger_tib)
# A tibble: 11 x 2
   df                          id
   <list>                   <int>
 1 <tibble [1,924,665 x 5]>     0
 2 <tibble [87 x 14]>           1
 3 <df [50 x 2]>                2
 4 <df [32 x 11]>               3
 5 <df [31 x 3]>                4
 6 <df [15 x 2]>                5
 7 <df [30 x 2]>                6
 8 <df [60 x 3]>                7
 9 <ts [468]>                   8
10 <table [4 x 2 x 2 x 2]>      9
11 <df [50 x 4]>               10

Benchmark

Unit: milliseconds
           expr      min       lq     mean   median       uq      max neval
 f1(bigger_tib) 619.5852 623.8327 638.0796 634.4866 644.9060 687.6760    10
 f2(bigger_tib) 230.6140 231.6163 234.4957 234.1330 237.1576 238.6012    10
 f3(bigger_tib)   4.0693   5.2220   5.5078   5.2996   5.4089   8.6592    10

One special note on that pivot_longer line: it means that we use the characters after "_" as names_to, discard the characters before "_". All values stack in the same column if having the same characters after "_".

How to speed up combining columns when one column is just a repetition of the same value?

Solution 1:

Solution 2:

Related

Recent Posts