How to speed up combining columns when one column is just a repetition of the same value?
Solution 1:
If the problem as stated is merging the unique value in first column with the second column. If the first column is just a repeated value and the second column contains all unique values then a simple solution is:
data.frame(all_letters_combined=c(df[1,1], df[,2]))
If you need to remove duplicates (duplicates in column 2 or column 1 is duplicated in column 2) from the resulting column. Based on ekoam's observation that dplyr::distinct()
is faster than unique()
Then here an option:
distinct(data.frame(all_letters_combined=c(df[1,1], df[,2])))
Of course if there are more columns and the different possibilities of values than a more complex solution would be required.
Solution 2:
The bottleneck is unique
, which becomes extremely costly when applied to
a list of dataframes. distinct
would be faster. On the other hand, if you already know that the dataframes are unique before pivoting them, giving each of them a unique id
to preserve this relationship would be an even more ideal approach. That said, consider the following benchmark.
library(dplyr)
library(tidyr)
f1 <- . %>% pivot_longer(everything()) %>% select(value) %>% unique()
f2 <- . %>% pivot_longer(everything()) %>% select(value) %>% distinct()
f3 <- . %>%
rename(one_df = one_df, other_df = other_dfs) %>%
mutate(one_id = 0L, other_id = row_number()) %>%
pivot_longer(starts_with(c("one", "other")), c(NA, ".value"), names_sep = "_") %>%
distinct(id, .keep_all = TRUE)
microbenchmark::microbenchmark(f1(bigger_tib), f2(bigger_tib), f3(bigger_tib), times = 10L)
Output
> f3(bigger_tib)
# A tibble: 11 x 2
df id
<list> <int>
1 <tibble [1,924,665 x 5]> 0
2 <tibble [87 x 14]> 1
3 <df [50 x 2]> 2
4 <df [32 x 11]> 3
5 <df [31 x 3]> 4
6 <df [15 x 2]> 5
7 <df [30 x 2]> 6
8 <df [60 x 3]> 7
9 <ts [468]> 8
10 <table [4 x 2 x 2 x 2]> 9
11 <df [50 x 4]> 10
Benchmark
Unit: milliseconds
expr min lq mean median uq max neval
f1(bigger_tib) 619.5852 623.8327 638.0796 634.4866 644.9060 687.6760 10
f2(bigger_tib) 230.6140 231.6163 234.4957 234.1330 237.1576 238.6012 10
f3(bigger_tib) 4.0693 5.2220 5.5078 5.2996 5.4089 8.6592 10
One special note on that pivot_longer
line: it means that we use the characters after "_"
as names_to
, discard the characters before "_"
. All values stack in the same column if having the same characters after "_"
.