Deleting columns from a data.frame where NA is more than 15% of the column length [duplicate]
Solution 1:
First, it's always good to share some sample data. It doesn't need to be your actual data--something made up is fine.
set.seed(1)
x <- rnorm(1000)
x[sample(1000, 150)] <- NA
mydf <- data.frame(matrix(x, ncol = 10))
Second, you can just use inbuilt functions to get what you need. Here, is.na(mydf)
does a logical check and returns a data.frame
of TRUE
and FALSE
. Since TRUE
and FALSE
equate to 1
and 0
, we can just use colMeans
to get the mean of the number of TRUE
(is NA
) values. That, in turn, can be checked according to your stipulations, in this case which columns have more than 15% NA
values?
colMeans(is.na(mydf)) > .15
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
As we can see, we should drop X1, X2, X6, X8, and X9. Again, taking advantage of logical vectors, here's how:
> final <- mydf[, colMeans(is.na(mydf)) <= .15]
> dim(final)
[1] 100 5
> names(final)
[1] "X3" "X4" "X5" "X7" "X10"