Delete columns/rows with more than x% missing

To remove columns with some amount of NA, you can use colMeans(

## Some sample data
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)

## Remove columns with more than 50% NA
dat[, which(colMeans(! > 0.5)]

## Remove rows with more than 50% NA
dat[which(rowMeans(! > 0.5), ]

## Remove columns and rows with more than 50% NA
dat[which(rowMeans(! > 0.5), which(colMeans(! > 0.5)]

A tidyverse solution that removes columns with an x% of NAs(50%) here:

test_data <- data.frame(A=c(rep(NA,12),
                        B = c(rep(10,12),
# Remove all with %NA >= 50
# can just use >50

 test_data %>% 
  purrr::discard(~sum(* 100 >=50)


1   10
2   10
3   10
4   10
5   10
6   10
7   10
8   10
9   10
10  10
11  10
12  10
13 520
14 233
15 522

A dplyr solution

For selecting columns based on a logical condition, we can use the selection helper where(), as in:


threshold<-0.5 #for a 50% cut-off

df %>% select(where(~mean(< threshold))

For filtering rows, dplyrs if_any() and if_all() will handle cases of 100 or 0% cutoffs, as in df %>% filter(if_any(everything(), For solutions with other threshold values, you can use rowMeans:


df %>% filter(rowMeans( < threshold)

Here is another tips ro filter df which has 50 NaNs in columns:

## Remove columns with more than 50% NA
rawdf.prep1 = rawdf[, sapply(rawdf, function(x) sum(*100 <= 50]

This will result a df with only NaN in columns not greater to 50%.