How to sort a character column by frequency of groups in R?

I asked a question here, but upon review, it wasn't very clear and was asking the wrong questions. Im asking another question now to try and make more clear my issue.

In my data frame I have a column comprised of characters and NAs. Im trying to sort this column into groups based on their frequency, whilst preserving the structure of the groups. For example, if my data looks like this:

library(dplyr)

dfTest <- 
structure(list(var = c("x1", NA, NA, "x4", NA, "x4", NA, NA, 
"x5", NA, NA, "x1", NA, NA, "x5", NA, NA, "x4", NA, "x4", NA, 
NA, "x5", NA, NA, "x2", NA, NA, "x1", NA, "x2", NA, NA, "x5", 
NA, NA), nam = c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 
5, 5, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5
), itr = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -36L))

> dfTest
# A tibble: 36 × 3
   var     nam   itr
   <chr> <dbl> <dbl>
 1 x1        1     1
 2 NA        1     1
 3 NA        1     1
 4 x4        2     1
 5 NA        2     1
 6 x4        2     1
 7 NA        2     1
 8 NA        2     1
 9 x5        3     1
10 NA        3     1
11 NA        3     1
12 x1        4     1
13 NA        4     1
14 NA        4     1
15 x5        5     1
16 NA        5     1
17 NA        5     1
18 x4        1     2
19 NA        1     2
20 x4        1     2
21 NA        1     2
22 NA        1     2
23 x5        2     2
24 NA        2     2
25 NA        2     2
26 x2        3     2
27 NA        3     2
28 NA        3     2
29 x1        4     2
30 NA        4     2
31 x2        4     2
32 NA        4     2
33 NA        4     2
34 x5        5     2
35 NA        5     2
36 NA        5     2

What I am trying to do is sort the var column by frequency for each itr, whilst preserving the the grouping seen in the nam column. So, for example, when itr = 1, and when we group var and nam, we can see that the structure x1, NA, NA appears twice, as does the structure x5, NA, NA. When itr = 2 the group x5, NA, NA appears twice, and all the rest of the groups appear only once.

My desired output would look something like the data frame below. Where, we can see that each var structure, has been sorted according to their frequency for each itr.

# A tibble: 36 × 3
   var     nam   itr
   <chr> <dbl> <dbl>
 1 x1        1     1
 2 NA        1     1
 3 NA        1     1
 4 x1        4     1
 5 NA        4     1
 6 NA        4     1
 7 x5        3     1
 8 NA        3     1
 9 NA        3     1
10 x5        5     1
11 NA        5     1
12 NA        5     1
13 x4        2     1
14 NA        2     1
15 x4        2     1
16 NA        2     1
17 NA        2     1
18 x5        2     2
19 NA        2     2
20 NA        2     2
21 x5        5     2
22 NA        5     2
23 NA        5     2
24 x4        1     2
25 NA        1     2
26 x4        1     2
27 NA        1     2
28 NA        1     2
29 x2        3     2
30 NA        3     2
31 NA        3     2
32 x1        4     2
33 NA        4     2
34 x2        4     2
35 NA        4     2
36 NA        4     2

I hope that makes sense.

Any suggestions as to how I could achieve this type of sorting whilst preserving the structure of each var by nam?


Solution 1:

Okay, I think this works. I've left in my helper columns for transparency, but you can remove them from the final select.

Basically, we collapse the data to one row per itr:nam group, keeping var around both as a concatenated string that we can group on and as a list column we can unnest to restore the original structures. We then count the frequency for each var_string, and sort the rows by this frequency, using the minimum nam value within the var_string group as a tiebreaker. This is the bulk of the work. We then unnest the data to get back to the original structure in the now-sorted order.

dfTest %>% 
  group_by(itr, nam) %>%
  summarize(var_string = paste(var, collapse = ";"), var= list(var))  %>%
  group_by(itr, var_string) %>%
  add_tally %>%
  mutate(min_nam = min(nam)) %>%
  arrange(itr, desc(n), min_nam, nam) %>%
  ungroup() %>%
  unnest(var) %>%
  select(var, nam, itr, everything()) %>%
  print.data.frame
#     var nam itr     var_string n min_nam
# 1    x1   1   1       x1;NA;NA 2       1
# 2  <NA>   1   1       x1;NA;NA 2       1
# 3  <NA>   1   1       x1;NA;NA 2       1
# 4    x1   4   1       x1;NA;NA 2       1
# 5  <NA>   4   1       x1;NA;NA 2       1
# 6  <NA>   4   1       x1;NA;NA 2       1
# 7    x5   3   1       x5;NA;NA 2       3
# 8  <NA>   3   1       x5;NA;NA 2       3
# 9  <NA>   3   1       x5;NA;NA 2       3
# 10   x5   5   1       x5;NA;NA 2       3
# 11 <NA>   5   1       x5;NA;NA 2       3
# 12 <NA>   5   1       x5;NA;NA 2       3
# 13   x4   2   1 x4;NA;x4;NA;NA 1       2
# 14 <NA>   2   1 x4;NA;x4;NA;NA 1       2
# 15   x4   2   1 x4;NA;x4;NA;NA 1       2
# 16 <NA>   2   1 x4;NA;x4;NA;NA 1       2
# 17 <NA>   2   1 x4;NA;x4;NA;NA 1       2
# 18   x5   2   2       x5;NA;NA 2       2
# 19 <NA>   2   2       x5;NA;NA 2       2
# 20 <NA>   2   2       x5;NA;NA 2       2
# 21   x5   5   2       x5;NA;NA 2       2
# 22 <NA>   5   2       x5;NA;NA 2       2
# 23 <NA>   5   2       x5;NA;NA 2       2
# 24   x4   1   2 x4;NA;x4;NA;NA 1       1
# 25 <NA>   1   2 x4;NA;x4;NA;NA 1       1
# 26   x4   1   2 x4;NA;x4;NA;NA 1       1
# 27 <NA>   1   2 x4;NA;x4;NA;NA 1       1
# 28 <NA>   1   2 x4;NA;x4;NA;NA 1       1
# 29   x2   3   2       x2;NA;NA 1       3
# 30 <NA>   3   2       x2;NA;NA 1       3
# 31 <NA>   3   2       x2;NA;NA 1       3
# 32   x1   4   2 x1;NA;x2;NA;NA 1       4
# 33 <NA>   4   2 x1;NA;x2;NA;NA 1       4
# 34   x2   4   2 x1;NA;x2;NA;NA 1       4
# 35 <NA>   4   2 x1;NA;x2;NA;NA 1       4
# 36 <NA>   4   2 x1;NA;x2;NA;NA 1       4