remove IDs that occur x times R

r rows

I have a df and I would like to remove people who have less than X amount of rows in df. E.g., in this toy example, I would like to retain people who have >= 5 rows.

df
   names  fruit
4   john   kiwi
7   john  apple
9   john banana
13  john orange
14  john  apple
2   mary orange
5   mary  apple
8   mary orange
10  mary  apple
12  mary  apple
1    tom  apple
3    tom banana
6    tom  apple
11   tom   kiwi

example output

df
   names  fruit
4   john   kiwi
7   john  apple
9   john banana
13  john orange
14  john  apple
2   mary orange
5   mary  apple
8   mary orange
10  mary  apple
12  mary  apple

Thanks in advance!

Here's a data.table solution using the in-built .N value, which is as described in the ?data.table help file: ‘.N’ is an integer, length 1, containing the number of rows in the group.

#create a similar reproducible exmaple
library(data.table)
dat <- data.table(names=rep(letters[1:3],c(5,5,3)),var=1:13)

Remove the rows:

dat[, cnt:=.N, by=names][cnt >= 5]

Though I feel like there must be a way to do this without assigning a new variable. ...And now there is thanks to @mnel in the comments:

dat[,if(.N>=5).SD,by=names]

This essentially returns a sub-data.table .SD for each value of the by group if the number of rows in the group .N is greater than or equal to 5. It is pretty much equivalent to the more traditional R subsetting syntax of:

dat[,.SD[.N >= 5],by=names]

You can use table like this:

df[df$names %in% names(table(df$names))[table(df$names) >= 5],]

remove IDs that occur x times R

Related

Recent Posts