R convert dataframe to list of unique memberships per column for each row

This is what I have:

> miniDF
      site1 site2 site3 site4 site5
Alpha     G     T     A     C     T
Beta      G     T     A     T     T
Delta     G     T     G     C     T
Gamma     G     C     A     T     T
Eps       G     T     A     T     T
Pi        A     T     A     T     T
Omi       G     T     A     C     A
miniDF = structure(list(site1 = c("G", "G", "G", "G", "G", "A", "G"), 
    site2 = c("T", "T", "T", "C", "T", "T", "T"), site3 = c("A", 
    "A", "G", "A", "A", "A", "A"), site4 = c("C", "T", "C", "T", 
    "T", "T", "C"), site5 = c("T", "T", "T", "T", "T", "T", "A"
    )), row.names = c("Alpha", "Beta", "Delta", "Gamma", "Eps", 
"Pi", "Omi"), class = "data.frame")

I'd like to convert it to a list structure for a venn diagram or upset plot where the presence of a unique letter in that column puts that site into the list row name:

myList = list('Alpha'=c('site4'), 'Beta'=c(), 'Delta'=c('site3', 'site4'), 'Gamma'=c('site2'), 'Eps'=c(), 'Pi'=c('site1'), 'Omi'=c('site4','site5'))

Alpha only has one unique site (a column with a unique cell) , Beta has none, but Delta and Omi have two unique sites.

Unique in this context means that cell is different from the other cells in that column. So for site1, A is the unique value (all the other values are G), so Pi includes that site in it's array.

For columns where there is more than one cell with a different value, like site4, I take the value of the first row to be the unique value, hence Alpha, Delta, and Omi include site4 in their arrays.

Assume I have a few hundred columns.

How can I do this?


Solution 1:

We create a function to find the "unique" values, then apply it to every column, and finally go through each row see which columns have the unique values.

I've used just base R. The code could probably be a bit more concise if we switched to purrr functions, or possibly more efficient if we used a matrix instead of a data frame.

pseudo_unique = function(x) {
  tx = sort(table(x))
  if(tx[1] == 1) return(names(tx[1])) else return(x[1])
}

u_vals = lapply(miniDF, pseudo_unique)
result = lapply(
  row.names(miniDF),
  \(row) names(miniDF)[which(unlist(Map("==", u_vals, miniDF[row, ])))]
)
names(result) = row.names(miniDF)  
result
# $Alpha
# [1] "site4"
# 
# $Beta
# character(0)
# 
# $Delta
# [1] "site3" "site4"
# 
# $Gamma
# [1] "site2"
# 
# $Eps
# character(0)
# 
# $Pi
# [1] "site1"
# 
# $Omi
# [1] "site4" "site5"

Here's the matrix version for the same result. With a few hundred columns, I'd recommend this version.

miniMat = as.matrix(miniDF)
u_vals = apply(miniMat, 2, pseudo_unique)
result = apply(miniMat, 1, \(row) colnames(miniMat)[row == u_vals], simplify = FALSE)

Solution 2:

Here's a solution in the tidyverse.

Solution

First import the tidyverse and generate your dataset miniDF.

library(tidyverse)

# ...
# Code to generate 'miniDF'.
# ...

Then define the custom function are_unique() to properly identify which values in each column you consider "unique".

are_unique <- function(x) {
  # Return an empty logical vector for an empty input...
  if(length(x) < 1) {
    return(logical(0))
  }
  
  # ...and otherwise identify which input values are strictly unique.
  are_unique <- !x %in% x[duplicated(x)]
  
  # If unique values actually exist, return that identification as is...
  if(any(are_unique)) {
    return(are_unique)
  }
  
  # ...and otherwise default to treating the first value as "unique"...
  token_unique <- x[1]
  # ...and identify its every occurrence.
  x == token_unique
}

Finally, apply this tidy workflow:

miniDF %>%
  # Make the letters (row names) a column of their own.
  rownames_to_column("letter") %>%
  # In every other column, identify which values you consider "unique".
  mutate(across(!letter, are_unique)) %>%
  # Pivot into 'col_name | is_unique' format for easy filtration.
  pivot_longer(!letter, names_to = "col_name", values_to = "is_unique") %>%
  # Split by letter into a list, with the subset of rows for each letter.
  split(.$letter) %>%
  # Convert each subset into the vector of 'col_name's that filter as "unique".
  sapply(function(x){x$col_name[x$is_unique]})

Result

Given a miniDF like your sample here

miniDF <- structure(
  list(
    site1 = c("G", "G", "G", "G", "G", "A", "G"), 
    site2 = c("T", "T", "T", "C", "T", "T", "T"),
    site3 = c("A", "A", "G", "A", "A", "A", "A"),
    site4 = c("C", "T", "C", "T", "T", "T", "C"),
    site5 = c("T", "T", "T", "T", "T", "T", "A")
  ),
  row.names = c("Alpha", "Beta", "Delta", "Gamma", "Eps", "Pi", "Omi"),
  class = "data.frame"
)

this solution should produce the following list:

list(
  Alpha = "site4",
  Beta  = character(0),
  Delta = c("site3", "site4"),
  Eps   = character(0),
  Gamma = "site2",
  Omi   = c("site4", "site5"),
  Pi    = "site1"
)

Note

The answer here by @GregorThomas should likely supersede my own. While my answer was technically posted first, I deleted that answer to fix an error, and Gregor's functional solution was posted before I finally undeleted mine.

Gregor's is likely more elegant anyway.