Concatenate several columns to comma separated strings by group

Background: I am in the process of annotating SNPs from a GWAS in an organism without much annotation. I am using the chained tBLASTn table from UCSC along with biomaRt to map each SNP to a probable gene(s).

I have a dataframe that looks like this:

            SNP   hu_mRNA     gene
 chr1.111642529 NM_002107    H3F3A
 chr1.111642529 NM_005324    H3F3B
 chr1.111801684 BC098118     <NA>
 chr1.111925084 NM_020435    GJC2
  chr1.11801605 AK027740     <NA>
  chr1.11801605 NM_032849    C13orf33
 chr1.151220354 NM_018913    PCDHGA10
 chr1.151220354 NM_018918    PCDHGA5

What I would like to end up with is a single row for each SNP, and comma delimit the genes and hu_mRNAs. Here is what I am after:

            SNP            hu_mRNA    gene
 chr1.111642529 NM_002107,NM_005324   H3F3A
 chr1.111801684  BC098118,NM_020435   GJC2
  chr1.11801605  AK027740,NM_032849   C13orf33
 chr1.151220354 NM_018913,NM_018918   PCDHGA10,PCDHGA5

Now I know I can do this with a flick of the wrist in perl, but I really want to do this all in R. Any suggestions?


You could do this in one line using plyr, as it is a classic split-apply-combine problem. You split using SNP, apply paste with collapse and assemble the pieces back into a data frame.

plyr::ddply(x, .(SNP), colwise(paste), collapse = ",")

If you want to do data reshaping in R at the flick of a wrist, learn plyr and reshape2 :). Another flick of the wrist solution using data.table, really useful if you are dealing with massive amounts of data.

data.table::data.table(x)[,lapply(.SD, paste, collapse = ","),'SNP']

You can use aggregate with paste for each one and merge at the end:

x <- structure(list(SNP = structure(c(1L, 1L, 2L, 3L, 4L, 4L, 5L, 
5L), .Label = c("chr1.111642529", "chr1.111801684", "chr1.111925084", 
"chr1.11801605", "chr1.151220354"), class = "factor"), hu_mRNA = structure(c(3L, 
4L, 2L, 7L, 1L, 8L, 5L, 6L), .Label = c("AK027740", "BC098118", 
"NM_002107", "NM_005324", "NM_018913", "NM_018918", "NM_020435", 
"NM_032849"), class = "factor"), gene = structure(c(4L, 5L, 1L, 
3L, 1L, 2L, 6L, 7L), .Label = c("<NA>", "C13orf33", "GJC2", "H3F3A", 
"H3F3B", "PCDHGA10", "PCDHGA5"), class = "factor")), .Names = c("SNP", 
"hu_mRNA", "gene"), class = "data.frame", row.names = c(NA, -8L
))

a1 <- aggregate(hu_mRNA~SNP,data=x,paste,sep=",")
a2 <- aggregate(gene~SNP,data=x,paste,sep=",")
merge(a1,a2)
             SNP              hu_mRNA              gene
1 chr1.111642529 NM_002107, NM_005324      H3F3A, H3F3B
2 chr1.111801684             BC098118              <NA>
3 chr1.111925084            NM_020435              GJC2
4  chr1.11801605  AK027740, NM_032849    <NA>, C13orf33
5 chr1.151220354 NM_018913, NM_018918 PCDHGA10, PCDHGA5

First set up the test data. Note that we have made the columns to be of "character" class rather than "factor" by using as.is=TRUE :

Lines <- "SNP   hu_mRNA     gene
 chr1.111642529 NM_002107    H3F3A
 chr1.111642529 NM_005324    H3F3B
 chr1.111801684 BC098118     <NA>
 chr1.111925084 NM_020435    GJC2
  chr1.11801605 AK027740     <NA>
  chr1.11801605 NM_032849    C13orf33
 chr1.151220354 NM_018913    PCDHGA10
 chr1.151220354 NM_018918    PCDHGA5"
cat(Lines, "\n", file = "data.txt")
DF <- read.table("data.txt", header = TRUE, na.strings = "<NA>", as.is = TRUE)

Now try this aggregate statement:

> aggregate(. ~ SNP, DF, toString)
             SNP              hu_mRNA              gene
1 chr1.111642529 NM_002107, NM_005324      H3F3A, H3F3B
2 chr1.111925084            NM_020435              GJC2
3  chr1.11801605            NM_032849          C13orf33
4 chr1.151220354 NM_018913, NM_018918 PCDHGA10, PCDHGA5

This can also be solved using reshape2's melt and dcast operations. With this approach, melt transforms the data to "long" format first, and then the values are dcast-ed with the same operation, paste(..., collapse = ","):

library(reshape2)

x <- read.table(
  stringsAsFactors = FALSE,
  header = TRUE,
  na.strings = "<NA>",
  text = "            SNP   hu_mRNA     gene
 chr1.111642529 NM_002107    H3F3A
 chr1.111642529 NM_005324    H3F3B
 chr1.111801684 BC098118     <NA>
 chr1.111925084 NM_020435    GJC2
  chr1.11801605 AK027740     <NA>
  chr1.11801605 NM_032849    C13orf33
 chr1.151220354 NM_018913    PCDHGA10
 chr1.151220354 NM_018918    PCDHGA5")

(xm <-melt(x, id.vars = "SNP", na.rm = TRUE))
##               SNP variable     value
## 1  chr1.111642529  hu_mRNA NM_002107
## 2  chr1.111642529  hu_mRNA NM_005324
## 3  chr1.111801684  hu_mRNA  BC098118
## 4  chr1.111925084  hu_mRNA NM_020435
## 5   chr1.11801605  hu_mRNA  AK027740
## 6   chr1.11801605  hu_mRNA NM_032849
## 7  chr1.151220354  hu_mRNA NM_018913
## 8  chr1.151220354  hu_mRNA NM_018918
## 9  chr1.111642529     gene     H3F3A
## 10 chr1.111642529     gene     H3F3B
## 12 chr1.111925084     gene      GJC2
## 14  chr1.11801605     gene  C13orf33
## 15 chr1.151220354     gene  PCDHGA10
## 16 chr1.151220354     gene   PCDHGA5

(xc <- dcast(xm, SNP~variable, fun.aggregate = paste, collapse = ","))
##              SNP             hu_mRNA             gene
## 1 chr1.111642529 NM_002107,NM_005324      H3F3A,H3F3B
## 2 chr1.111801684            BC098118                 
## 3 chr1.111925084           NM_020435             GJC2
## 4  chr1.11801605  AK027740,NM_032849         C13orf33
## 5 chr1.151220354 NM_018913,NM_018918 PCDHGA10,PCDHGA5

Here's a dplyr solution, which IHMO is the most readable:

library(dplyr)

x %>%
  group_by(SNP) %>%
  summarize(
    genes = paste(gene, collapse = ','),
    hu_mRNA = paste(hu_mRNA, collapse = ',')
  )

The result:

Source: local data frame [5 x 3]

             SNP            genes             hu_mRNA
          (fctr)            (chr)               (chr)
1 chr1.111642529      H3F3A,H3F3B NM_002107,NM_005324
2 chr1.111801684             <NA>            BC098118
3 chr1.111925084             GJC2           NM_020435
4  chr1.11801605    <NA>,C13orf33  AK027740,NM_032849
5 chr1.151220354 PCDHGA10,PCDHGA5 NM_018913,NM_018918