Replace missing values with column mean

I am not sure how to loop over each column to replace the NA values with the column mean. When I am trying to replace for one column using the following, it works well.

Column1[is.na(Column1)] <- round(mean(Column1, na.rm = TRUE))

The code for looping over columns is not working:

for(i in 1:ncol(data)){
    data[i][is.na(data[i])] <- round(mean(data[i], na.rm = TRUE))
}

the values are not replaced. Can someone please help me with this?

A relatively simple modification of your code should solve the issue:

for(i in 1:ncol(data)){
  data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}

If DF is your data frame of numeric columns:

library(zoo)
na.aggregate(DF)

ADDED:

Using only the base of R define a function which does it for one column and then lapply to every column:

NA2mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
replace(DF, TRUE, lapply(DF, NA2mean))

The last line could be replaced with the following if it's OK to overwrite the input:

DF[] <- lapply(DF, NA2mean)

There is also quick solution using the imputeTS package:

library(imputeTS)
na_mean(yourDataFrame)

dplyr's mutate_all or mutate_at could be useful here:

library(dplyr)                                                             

set.seed(10)                                                               
df <- data.frame(a = sample(c(NA, 1:3)    , replace = TRUE, 10),           
                 b = sample(c(NA, 101:103), replace = TRUE, 10),                            
                 c = sample(c(NA, 201:203), replace = TRUE, 10))                            

df         

#>     a   b   c
#> 1   2 102 203
#> 2   1 102 202
#> 3   1  NA 203
#> 4   2 102 201
#> 5  NA 101 201
#> 6  NA 101 202
#> 7   1  NA 203
#> 8   1 101  NA
#> 9   2 101 203
#> 10  1 103 201

df %>% mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))          

#>        a       b        c
#> 1  2.000 102.000 203.0000
#> 2  1.000 102.000 202.0000
#> 3  1.000 101.625 203.0000
#> 4  2.000 102.000 201.0000
#> 5  1.375 101.000 201.0000
#> 6  1.375 101.000 202.0000
#> 7  1.000 101.625 203.0000
#> 8  1.000 101.000 202.1111
#> 9  2.000 101.000 203.0000
#> 10 1.000 103.000 201.0000

df %>% mutate_at(vars(a, b),~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))

#>        a       b   c
#> 1  2.000 102.000 203
#> 2  1.000 102.000 202
#> 3  1.000 101.625 203
#> 4  2.000 102.000 201
#> 5  1.375 101.000 201
#> 6  1.375 101.000 202
#> 7  1.000 101.625 203
#> 8  1.000 101.000  NA
#> 9  2.000 101.000 203
#> 10 1.000 103.000 201

To add to the alternatives, using @akrun's sample data, I would do the following:

d1[] <- lapply(d1, function(x) { 
  x[is.na(x)] <- mean(x, na.rm = TRUE)
  x
})
d1

Replace missing values with column mean

Related

Recent Posts