Deleting columns from a data.frame where NA is more than 15% of the column length [duplicate]

Solution 1:

First, it's always good to share some sample data. It doesn't need to be your actual data--something made up is fine.

set.seed(1)
x <- rnorm(1000)
x[sample(1000, 150)] <- NA
mydf <- data.frame(matrix(x, ncol = 10))

Second, you can just use inbuilt functions to get what you need. Here, is.na(mydf) does a logical check and returns a data.frame of TRUE and FALSE. Since TRUE and FALSE equate to 1 and 0, we can just use colMeans to get the mean of the number of TRUE (is NA) values. That, in turn, can be checked according to your stipulations, in this case which columns have more than 15% NA values?

colMeans(is.na(mydf)) > .15
#    X1    X2    X3    X4    X5    X6    X7    X8    X9   X10 
#  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE

As we can see, we should drop X1, X2, X6, X8, and X9. Again, taking advantage of logical vectors, here's how:

> final <- mydf[, colMeans(is.na(mydf)) <= .15]
> dim(final)
[1] 100   5
> names(final)
[1] "X3"  "X4"  "X5"  "X7"  "X10"