"Set Difference" between two vectors with duplicate values
I have 3 vectors
x <- c(1,3,5,7,3,8)
y <- c(3,5,7)
z <- c(3,3,8)
I want to find the elements of x
that are not in y
and not in z
. Is there a function f
that would give me the following output:
> f(x,y)
1 3 8
> f(x,z)
1 5 7
In other words, I want to find the "set difference" between 2 vectors, either of which may have repeated values. The functions %in%
, match
and setdiff
do not work in this case for obvious reasons.
Solution 1:
There should be some better ways to do this but here is one option
get_diff_vectors <- function(x, y) {
count_x <- table(x)
count_y <- table(y)
same_counts <- match(names(count_y), names(count_x))
count_x[same_counts] <- count_x[same_counts] - count_y
as.numeric(rep(names(count_x), count_x))
}
get_diff_vectors(x, y)
#[1] 1 3 8
get_diff_vectors(x, z)
#[1] 1 5 7
get_diff_vectors(x, c(5, 7))
#[1] 1 3 3 8
We count the frequency of x
and y
using table
, match
the numbers which occur in both and subtract the counts y
from x
. Finally recreate the remaining vector using rep
.
Still not able to find a better way but here is dplyr
way using the somewhat similar logic.
library(dplyr)
get_diff_vectors_dplyr <- function(x, y) {
df1 <- data.frame(x) %>% count(x)
df2 <- data.frame(y) %>% count(y)
final <- left_join(df1, df2, by = c("x" = "y")) %>%
mutate_at(c("n.x", "n.y"), funs(replace(., is.na(.), 0))) %>%
mutate(n = n.x - n.y)
rep(final$x, final$n)
}
get_diff_vectors_dplyr(x, y)
#[1] 1 3 8
get_diff_vectors_dplyr(x, z)
#[1] 1 5 7
get_diff_vectors_dplyr(x, c(5, 7))
#[1] 1 3 3 8
The vecsets
package mentioned by OP has function vsetdiff
which does this very easily
vecsets::vsetdiff(x, y)
#[1] 1 3 8
vecsets::vsetdiff(x, z)
#[1] 1 5 7
vecsets::vsetdiff(x, c(5, 7))
#[1] 1 3 3 8
Solution 2:
Here's an attempt using make.unique
to account for duplicates:
dupdiff <- function(x,y) x[-match(
make.unique(as.character(y)),
make.unique(as.character(x)),
nomatch=0
)]
Testing:
dupdiff(x,y)
#[1] 1 3 8
dupdiff(x,z)
#[1] 1 5 7
dupdiff(x, c(5, 7))
#[1] 1 3 3 8
dupdiff(x, c(5, 7, 9))
#[1] 1 3 3 8