How to delete rows from a dataframe that contain n*NA
I have a number of large datasets with ~10 columns, and ~200000 rows. Not all columns contain values for each row, although at least one column must contain a value for the row to be present, I would like to set a threshold for how many NA
s are allowed in a row.
My Dataframe looks something like this:
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
C NA 9 4 NA 4 8 4 NA 5 NA
D 2 2 6 8 4 NA 3 7 1 32
And I would like to be able to delete the rows that contain more than 2 cells containing NA to get
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
D 2 2 6 8 4 NA 3 7 1 32
complete.cases
removes all rows containing any NA
, and I know one can delete rows that contain NA
in certain columns but is there a way to modify it so that it is non-specific about which columns contain NA
, but how many of the total do?
Alternatively, this dataframe is generated by merging several dataframes using
file1<-read.delim("~/file1.txt")
file2<-read.delim(file=args[1])
file1<-merge(file1,file2,by="chr.pos",all=TRUE)
Perhaps the merge function could be altered?
Thanks
Solution 1:
Use rowSums
. To remove rows from a data frame (df
) that contain precisely n NA
values:
df <- df[rowSums(is.na(df)) != n, ]
or to remove rows that contain n or more NA
values:
df <- df[rowSums(is.na(df)) < n, ]
in both cases of course replacing n
with the number that's required
Solution 2:
If dat
is the name of your data.frame the following will return what you're looking for:
keep <- rowSums(is.na(dat)) < 2
dat <- dat[keep, ]
What this is doing:
is.na(dat)
# returns a matrix of T/F
# note that when adding logicals
# T == 1, and F == 0
rowSums(.)
# quickly computes the total per row
# since your task is to identify the
# rows with a certain number of NA's
rowSums(.) < 2
# for each row, determine if the sum
# (which is the number of NAs) is less
# than 2 or not. Returns T/F accordingly
We use the output of this last statement to identify which rows to keep. Note that it is not necessary to actually store this last logical.
Solution 3:
If d
is your data frame, try this:
d <- d[rowSums(is.na(d)) < 2,]
Solution 4:
This will return a dataset where at most two values per row are missing:
dfrm[ apply(dfrm, 1, function(r) sum(is.na(x)) <= 2 ) , ]