Randomly insert NAs into dataframe proportionaly
I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data.
A <- c(1:10)
B <- c(11:20)
C <- c(21:30)
df<- data.frame(A,B,C)
Can anyone suggest a quick way of doing that?
df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26
as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30
It's a random process, so it might not give 15% every time.
You can unlist the data.frame and then take a random sample, then put back in a data.frame.
df <- unlist(df)
n <- length(df) * 0.15
df[sample(df, n)] <- NA
as.data.frame(matrix(df, ncol=3))
It can be done a bunch of different ways using sample().
If you are in the mood to use purrr
instead of lapply
, you can also do it like this:
> library(purrr)
> df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
> df
A B C
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
5 5 15 25
6 6 16 26
7 7 17 27
8 8 18 28
9 9 19 29
10 10 20 30
> map_df(df, function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 1 11 21
2 2 12 22
3 NA 13 NA
4 4 14 NA
5 5 15 25
6 6 16 26
7 7 17 27
8 8 NA 28
9 9 19 29
10 10 20 30
Same result, using binomial distribution:
dd=dim(df)
nna=20/100 #overall
df1<-df
df1[matrix(rbinom(prod(dd), size=1,prob=nna)==1,nrow=dd[1])]<-NA
df1