Concatenate several columns to comma separated strings by group
Background: I am in the process of annotating SNPs from a GWAS in an organism without much annotation. I am using the chained tBLASTn table from UCSC along with biomaRt to map each SNP to a probable gene(s).
I have a dataframe that looks like this:
SNP hu_mRNA gene
chr1.111642529 NM_002107 H3F3A
chr1.111642529 NM_005324 H3F3B
chr1.111801684 BC098118 <NA>
chr1.111925084 NM_020435 GJC2
chr1.11801605 AK027740 <NA>
chr1.11801605 NM_032849 C13orf33
chr1.151220354 NM_018913 PCDHGA10
chr1.151220354 NM_018918 PCDHGA5
What I would like to end up with is a single row for each SNP, and comma delimit the genes and hu_mRNAs. Here is what I am after:
SNP hu_mRNA gene
chr1.111642529 NM_002107,NM_005324 H3F3A
chr1.111801684 BC098118,NM_020435 GJC2
chr1.11801605 AK027740,NM_032849 C13orf33
chr1.151220354 NM_018913,NM_018918 PCDHGA10,PCDHGA5
Now I know I can do this with a flick of the wrist in perl, but I really want to do this all in R. Any suggestions?
You could do this in one line using plyr
, as it is a classic split-apply-combine
problem. You split using SNP
, apply paste
with collapse
and assemble the pieces back into a data frame.
plyr::ddply(x, .(SNP), colwise(paste), collapse = ",")
If you want to do data
reshaping in R at the flick of a wrist
, learn plyr
and reshape2
:). Another flick of the wrist solution using data.table
, really useful if you are dealing with massive amounts of data.
data.table::data.table(x)[,lapply(.SD, paste, collapse = ","),'SNP']
You can use aggregate
with paste
for each one and merge
at the end:
x <- structure(list(SNP = structure(c(1L, 1L, 2L, 3L, 4L, 4L, 5L,
5L), .Label = c("chr1.111642529", "chr1.111801684", "chr1.111925084",
"chr1.11801605", "chr1.151220354"), class = "factor"), hu_mRNA = structure(c(3L,
4L, 2L, 7L, 1L, 8L, 5L, 6L), .Label = c("AK027740", "BC098118",
"NM_002107", "NM_005324", "NM_018913", "NM_018918", "NM_020435",
"NM_032849"), class = "factor"), gene = structure(c(4L, 5L, 1L,
3L, 1L, 2L, 6L, 7L), .Label = c("<NA>", "C13orf33", "GJC2", "H3F3A",
"H3F3B", "PCDHGA10", "PCDHGA5"), class = "factor")), .Names = c("SNP",
"hu_mRNA", "gene"), class = "data.frame", row.names = c(NA, -8L
))
a1 <- aggregate(hu_mRNA~SNP,data=x,paste,sep=",")
a2 <- aggregate(gene~SNP,data=x,paste,sep=",")
merge(a1,a2)
SNP hu_mRNA gene
1 chr1.111642529 NM_002107, NM_005324 H3F3A, H3F3B
2 chr1.111801684 BC098118 <NA>
3 chr1.111925084 NM_020435 GJC2
4 chr1.11801605 AK027740, NM_032849 <NA>, C13orf33
5 chr1.151220354 NM_018913, NM_018918 PCDHGA10, PCDHGA5
First set up the test data. Note that we have made the columns to be of "character"
class rather than "factor"
by using as.is=TRUE
:
Lines <- "SNP hu_mRNA gene
chr1.111642529 NM_002107 H3F3A
chr1.111642529 NM_005324 H3F3B
chr1.111801684 BC098118 <NA>
chr1.111925084 NM_020435 GJC2
chr1.11801605 AK027740 <NA>
chr1.11801605 NM_032849 C13orf33
chr1.151220354 NM_018913 PCDHGA10
chr1.151220354 NM_018918 PCDHGA5"
cat(Lines, "\n", file = "data.txt")
DF <- read.table("data.txt", header = TRUE, na.strings = "<NA>", as.is = TRUE)
Now try this aggregate
statement:
> aggregate(. ~ SNP, DF, toString)
SNP hu_mRNA gene
1 chr1.111642529 NM_002107, NM_005324 H3F3A, H3F3B
2 chr1.111925084 NM_020435 GJC2
3 chr1.11801605 NM_032849 C13orf33
4 chr1.151220354 NM_018913, NM_018918 PCDHGA10, PCDHGA5
This can also be solved using reshape2
's melt
and dcast
operations. With this approach, melt
transforms the data to "long" format first, and then the values are dcast
-ed with the same operation, paste(..., collapse = ",")
:
library(reshape2)
x <- read.table(
stringsAsFactors = FALSE,
header = TRUE,
na.strings = "<NA>",
text = " SNP hu_mRNA gene
chr1.111642529 NM_002107 H3F3A
chr1.111642529 NM_005324 H3F3B
chr1.111801684 BC098118 <NA>
chr1.111925084 NM_020435 GJC2
chr1.11801605 AK027740 <NA>
chr1.11801605 NM_032849 C13orf33
chr1.151220354 NM_018913 PCDHGA10
chr1.151220354 NM_018918 PCDHGA5")
(xm <-melt(x, id.vars = "SNP", na.rm = TRUE))
## SNP variable value
## 1 chr1.111642529 hu_mRNA NM_002107
## 2 chr1.111642529 hu_mRNA NM_005324
## 3 chr1.111801684 hu_mRNA BC098118
## 4 chr1.111925084 hu_mRNA NM_020435
## 5 chr1.11801605 hu_mRNA AK027740
## 6 chr1.11801605 hu_mRNA NM_032849
## 7 chr1.151220354 hu_mRNA NM_018913
## 8 chr1.151220354 hu_mRNA NM_018918
## 9 chr1.111642529 gene H3F3A
## 10 chr1.111642529 gene H3F3B
## 12 chr1.111925084 gene GJC2
## 14 chr1.11801605 gene C13orf33
## 15 chr1.151220354 gene PCDHGA10
## 16 chr1.151220354 gene PCDHGA5
(xc <- dcast(xm, SNP~variable, fun.aggregate = paste, collapse = ","))
## SNP hu_mRNA gene
## 1 chr1.111642529 NM_002107,NM_005324 H3F3A,H3F3B
## 2 chr1.111801684 BC098118
## 3 chr1.111925084 NM_020435 GJC2
## 4 chr1.11801605 AK027740,NM_032849 C13orf33
## 5 chr1.151220354 NM_018913,NM_018918 PCDHGA10,PCDHGA5
Here's a dplyr
solution, which IHMO is the most readable:
library(dplyr)
x %>%
group_by(SNP) %>%
summarize(
genes = paste(gene, collapse = ','),
hu_mRNA = paste(hu_mRNA, collapse = ',')
)
The result:
Source: local data frame [5 x 3]
SNP genes hu_mRNA
(fctr) (chr) (chr)
1 chr1.111642529 H3F3A,H3F3B NM_002107,NM_005324
2 chr1.111801684 <NA> BC098118
3 chr1.111925084 GJC2 NM_020435
4 chr1.11801605 <NA>,C13orf33 AK027740,NM_032849
5 chr1.151220354 PCDHGA10,PCDHGA5 NM_018913,NM_018918