Extract elements common in all column groups
I have a R dataset x as below:
ID Month
1 1 Jan
2 3 Jan
3 4 Jan
4 6 Jan
5 6 Jan
6 9 Jan
7 2 Feb
8 4 Feb
9 6 Feb
10 8 Feb
11 9 Feb
12 10 Feb
13 1 Mar
14 3 Mar
15 4 Mar
16 6 Mar
17 7 Mar
18 9 Mar
19 2 Apr
20 4 Apr
21 6 Apr
22 7 Apr
23 8 Apr
24 10 Apr
25 1 May
26 2 May
27 4 May
28 6 May
29 7 May
30 8 May
31 2 Jun
32 4 Jun
33 5 Jun
34 6 Jun
35 9 Jun
36 10 Jun
I am trying to figure out a R function/code to identify all IDs that exist atleast once in every month.
In the above case, ID
4
& 6
are present in all months.
Thanks
First, split the df$ID
by Month
and use intersect
to find elements common in each sub-group.
Reduce(intersect, split(df$ID, df$Month))
#[1] 4 6
If you want to subset the corresponding data.frame, do
df[df$ID %in% Reduce(intersect, split(df$ID, df$Month)),]
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by 'ID', get the row index (.I
) where the number of unique 'Months' are equal to the number of unique 'Months' in the whole dataset and subset the data based on this
library(data.table)
setDT(df1)[df1[, .I[uniqueN(Month) == uniqueN(df1$Month)], ID]$V1]
# ID Month
# 1: 4 Jan
# 2: 4 Feb
# 3: 4 Mar
# 4: 4 Apr
# 5: 4 May
# 6: 4 Jun
# 7: 6 Jan
# 8: 6 Jan
# 9: 6 Feb
#10: 6 Mar
#11: 6 Apr
#12: 6 May
#13: 6 Jun
To extract the 'ID's
setDT(df1)[, ID[uniqueN(Month) == uniqueN(df1$Month)], ID]$V1
#[1] 4 6
Or with base R
1) Using table
with rowSums
v1 <- rowSums(table(df1) > 0)
names(v1)[v1==max(v1)]
#[1] "4" "6"
This info can be used for subsetting the data
subset(df1, ID %in% names(v1)[v1 == max(v1)])
2) Using tapply
lst <- with(df1, tapply(Month, ID, FUN = unique))
names(which(lengths(lst) == length(unique(df1$Month))))
#[1] "4" "6"
Or using dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n_distinct(Month)== n_distinct(df1$Month)) %>%
.$ID %>%
unique
#[1] 4 6
or if we need to get the rows
df1 %>%
group_by(ID) %>%
filter(n_distinct(Month)== n_distinct(df1$Month))
# A tibble: 13 x 2
# Groups: ID [2]
# ID Month
# <int> <chr>
# 1 4 Jan
# 2 6 Jan
# 3 6 Jan
# 4 4 Feb
# 5 6 Feb
# 6 4 Mar
# 7 6 Mar
# 8 4 Apr
# 9 6 Apr
#10 4 May
#11 6 May
#12 4 Jun
#13 6 Jun
An alternative solution using dplyr and purrr:
tib %>%
dplyr::group_by(Month) %>%
dplyr::group_split(.keep = F) %>%
purrr::reduce(intersect)
# A tibble: 2 x 1
# ID
# <dbl>
# 1 4
# 2 6
returns the desired IDs, where tib
is a tibble containing the input data.