Collapsing rows where some are all NA, others are disjoint with some NAs

Solution 1:

Here's a data table approach that uses na.omit() across the columns, grouped by ID.

library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = ID]
#    ID Col1 Col2 Col3 Col4
# 1:  1    5   10   15   20
# 2:  2   25   30   35   40

Solution 2:

Try

library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))

Edit: To account for the case in which one column has all NAs for a certain ID, we need sum_NA() function which returns NA if all are NAs

txt <- "ID    Col1    Col2    Col3    Col4
        1     NA      NA      NA      NA
        1     5       10      NA      NA
        1     NA      NA      15      20
        2     NA      NA      NA      NA
        2     NA      30      NA      NA
        2     NA      NA      35      40"
DF <- read.table(text = txt, header = TRUE)

# original code
DF %>% 
  group_by(ID) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2     0    30    35    40

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

DF %>%
  group_by(ID) %>%
  summarise_all(funs(sum_NA))

DF %>%
  group_by(ID) %>%
  summarise_if(is.numeric, funs(sum_NA))

# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

Solution 3:

Here's a couple of aggregate attempts:

aggregate(. ~ ID, data=dat, FUN=na.omit, na.action="na.pass")
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40

Since aggregate's formula interface by default uses na.omit on the entire data before doing any grouping, it will delete every row of dat as they all contain at least one NA value. Try it: nrow(na.omit(dat)) returns 0. So in this case, use na.pass in aggregate and then na.omit to skip over the NAs that were passed through.

Alternatively, don't use the formula interface and specify the columns to aggregate manually:

aggregate(dat[-1], dat[1], FUN=na.omit )
aggregate(dat[c("Col1","Col2","Col3","Col4")], dat["ID"], FUN=na.omit)
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40

Solution 4:

Since dplyr 1.0.0, you can also do (using the data provided by @Khashaa):

df %>% 
 group_by(ID) %>%
 summarize(across(everything(), ~ first(na.omit(.))))

     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

Solution 5:

the simple way is:

as.data.frame(lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)[!is.na(x)]))

but if not all columns have the same number of non-NA values then you'll need to trim them like so:

temp  <-  lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)x[!is.na(x)])
len  <-  min(sapply(temp,length))
as.data.frame(lapply(temp,`[`,seq(len)))