Simulate data and randomly add missing values to dataframe
How can I randomly add missing values to some or each column (say random ~5% missing in each) in a simulated dataframe, plus, is there a more efficient way of simulating a dataframe with both continuous and factor columns?
#Simulate some data
N <- 2000
data <- data.frame(id = 1:2000,age = rnorm(N,18:90),bmi = rnorm(N,15:40),
chol = rnorm(N,50:350), insulin = rnorm(N,2:40),sbp = rnorm(N, 50:200),
dbp = rnorm(N, 30:150), sex = c(rep(1, 1000), rep(2, 1000)),
smoke = rep(c(1, 2), 1000), educ = sample(LETTERS[1:4]))
#Manually add some missing values
data <- data %>%
mutate(age = "is.na<-"(age, age <19 | age >88),
bmi = "is.na<-"(bmi, bmi >38 | bmi <16),
insulin = "is.na<-"(insulin, insulin >38),
educ = "is.na<-"(educ, bmi >35))
Solution 1:
Best solution in my opinion would be using the mice package for this. This is a R package dedicated to imputation. It also has a function called amputate for introducing missing data into a data.frame.
ampute - Generate Missing Data For Simulation Purposes This function generates multivariate missing data in a MCAR, MAR or MNAR manner.
The advantage of this solution is you can set multiple parameters for the simulation of your missing data.
ampute(data, prop = 0.5, patterns = NULL, freq = NULL, mech = "MAR",
weights = NULL, cont = TRUE, type = NULL, odds = NULL,
bycases = TRUE, run = TRUE)
As you can see you can set the percentage of missing values, the missing data mechanism (MCAR would be your choice for missing completely at random) and several other parameters. This solution would also be quite clean since it is only 1 line of code.
Solution 2:
Here's a tidyverse
approach that will remove roughly 20% of your data for each column you specify:
set.seed(1)
# example data
N <- 20
data <- data.frame(id = 1:N,
age = rnorm(N,18:90),
bmi = rnorm(N,15:40),
chol = rnorm(N,50:350))
library(tidyverse)
# specify which variables should have missing data and prc of missing data
c_names = c("age","bmi")
prc_missing = 0.20
data %>%
gather(var, value, -id) %>% # reshape data
mutate(r = runif(nrow(.)), # simulate a random number from 0 to 1 for each row
value = ifelse(var %in% c_names & r <= prc_missing, NA, value)) %>% # if it's one of the variables you specified and the random number is less than your threshold update to NA
select(-r) %>% # remove random number
spread(var, value) # reshape back to original format
# id age bmi chol
# 1 1 17.37355 15.91898 49.83548
# 2 2 19.18364 16.78214 50.74664
# 3 3 19.16437 17.07456 52.69696
# 4 4 NA 16.01065 53.55666
# 5 5 22.32951 19.61983 53.31124
# 6 6 22.17953 19.94387 54.29250
# 7 7 24.48743 NA 56.36458
# 8 8 25.73832 20.52925 57.76853
# 9 9 26.57578 NA 57.88765
# 10 10 26.69461 24.41794 59.88111
# 11 11 29.51178 26.35868 60.39811
# 12 12 NA 25.89721 60.38797
# 13 13 NA 27.38767 62.34112
# 14 14 28.78530 27.94619 61.87064
# 15 15 33.12493 27.62294 65.43302
# 16 16 32.95507 NA 66.98040
# 17 17 33.98381 30.60571 65.63278
# 18 18 35.94384 NA 65.95587
# 19 19 36.82122 34.10003 68.56972
# 20 20 37.59390 34.76318 68.86495
And this is an alternative that will remove exactly 20% of data for the columns you specify:
set.seed(1)
# example data
N <- 20
data <- data.frame(id = 1:N,
age = rnorm(N,18:90),
bmi = rnorm(N,15:40),
chol = rnorm(N,50:350))
library(tidyverse)
# specify which variables should have missing data and prc of missing data
c_names = c("age","bmi")
prc_missing = 0.20
n_remove = prc_missing * nrow(data)
data %>%
gather(var, value, -id) %>% # reshape data
sample_frac(1) %>% # shuffle rows
group_by(var) %>% # for each variables
mutate(value = ifelse(var %in% c_names & row_number() <= n_remove, NA, value)) %>% # update to NA top x number of rows if it's one of the variables you specified
spread(var, value) # reshape to original format
# # A tibble: 20 x 4
# id age bmi chol
# <int> <dbl> <dbl> <dbl>
# 1 1 17.4 15.9 49.8
# 2 2 19.2 16.8 50.7
# 3 3 19.2 17.1 52.7
# 4 4 NA 16.0 53.6
# 5 5 22.3 NA 53.3
# 6 6 22.2 19.9 54.3
# 7 7 24.5 20.8 56.4
# 8 8 25.7 NA 57.8
# 9 9 26.6 NA 57.9
# 10 10 NA NA 59.9
# 11 11 NA 26.4 60.4
# 12 12 NA 25.9 60.4
# 13 13 29.4 27.4 62.3
# 14 14 28.8 27.9 61.9
# 15 15 33.1 27.6 65.4
# 16 16 33.0 29.6 67.0
# 17 17 34.0 30.6 65.6
# 18 18 35.9 31.9 66.0
# 19 19 36.8 34.1 68.6
# 20 20 37.6 34.8 68.9