Remove duplicate column pairs, sort rows based on 2 columns [duplicate]
in the following dataframe I want to keep rows only once if they have duplicate pairs (1 4 and 4 1 are considered the same pair) of Var1
and Var2
. I thought of sorting Var1
and Var2
within the row and then remove duplicate rows based on both Var1
and Var2
. However, I don't get to my desired result.
This is what my data looks like:
Var1 <- c(1,2,3,4,5,5)
Var2 <- c(4,3,2,1,5,5)
f <- c("blue","green","yellow","red","orange2","grey")
g <- c("blue","green","yellow","red","orange1","grey")
testdata <- data.frame(Var1,Var2,f,g)
I can sort within the rows, however the values of columns f and g should remain untouched, how do I do this?
testdata <- t(apply(testdata, 1, function(x) x[order(x)]))
testdata <- as.data.table(testdata)
Then, I want to remove duplicate rows based on Var1
and Var2
I want to get this as a result:
Var1 Var2 f g
1 4 blue blue
2 3 green green
5 5 orange2 orange1
Thanks for your help!
Solution 1:
In case people are interested in solving this using dplyr:
library(dplyr)
testdata %>%
rowwise() %>%
mutate(key = paste(sort(c(Var1, Var2)), collapse="")) %>%
distinct(key, .keep_all=T) %>%
select(-key)
# Source: local data frame [3 x 4]
# Groups: <by row>
#
# # A tibble: 3 × 4
# Var1 Var2 f g
# <dbl> <dbl> <fctr> <fctr>
# 1 1 4 blue blue
# 2 2 3 green green
# 3 5 5 orange2 orange1
Solution 2:
Instead of sorting for the whole dataset, sort the 'Var1', 'Var2', and then use duplicated
to remove the duplicate rows
testdata[1:2] <- t( apply(testdata[1:2], 1, sort) )
testdata[!duplicated(testdata[1:2]),]
# Var1 Var2 f g
#1 1 4 blue blue
#2 2 3 green green
#5 5 5 orange2 orange1