How to compute the mean for certain category rows?

Assuming this data:

df <- tibble(
word = c("apple", "apple","banana", "pear","pear","A","A","A"), 
i = seq_along(word),year=c(2000,2001,2000,2000,2001,2000,2001,2002)
)

I want to calculate the mean of column i for all words (per year) in column word without A and banana. and give the name out the output would be

      word       i  year
     <chr>  <int> <dbl>
     1 out      2.5  2000
     2 out      3.5  2001
     3 banana     4  2000
     5 A          7  2000
     8 A          8  2001
     9 A          9  2002

We may group by 'year', and the 'word' after replaceing the elements that are not 'A', 'banana' with 'out' and get the mean of 'i' and order the rows by 'group' (arrange) if neccessary

library(dplyr)
df %>% 
  group_by(year, word = replace(word, !(word %in% c('A', 'banana')), 'out')) %>%
  summarise(i = mean(i), .groups = 'drop') %>%
  arrange(factor(word, levels = c('out', 'banana', 'A'))) %>%
  select(names(df))

First we replace all not banana and A with out in an ifelse statement, then group and summarise. Finally arrange:

library(dplyr)
df %>% 
  mutate(word = ifelse(word != "banana" & word !="A", "out", word)) %>% 
  group_by(year, word) %>% 
  summarise(mean_i = mean(i)) %>% 
  arrange(desc(word))

   year word   mean_i
  <dbl> <chr>   <dbl>
1  2000 out       2.5
2  2001 out       3.5
3  2000 banana    3  
4  2000 A         6  
5  2001 A         7  
6  2002 A         8

Using data.table:

library(data.table)
dt <- as.data.table(df)

# assign name "mean" to the new column with average (mean) values
> dt[!(word %in% c('banana', 'A') ), .(mean = mean(i)), by = .(year, word)]
   year  word mean
1: 2000 apple    1
2: 2001 apple    2
3: 2000  pear    4
4: 2001  pear    5

Base R solution

We first split the data by word and, within word, year. Then give us the mean of i. Store the mean values of i pertaining to all words but 'A' and 'apple' in out. Finally, make a data.frame with columns for word, year, and our means and use some regex to obtain tidy column names.

means <- sapply(split(df, f = ~ df$word + df$year), \(x) mean(x$i))
out <- means[!grepl('^[Aa]', names(means))]
df_new <- data.frame(
  word=gsub('\\.[0-9]*', '', names(out)),
  year=gsub('[^[0-9]]*', '', names(out)),
  mean_i=out
)
rownames(df_new) <- NULL

Output

> df_new
    word year mean_i
1 banana 2000      3
2   pear 2000      4
3 banana 2001    NaN
4   pear 2001      5
5 banana 2002    NaN
6   pear 2002    NaN

If you want to omit rows containing missing values of mean_i, then you could use something like df_new[rowSums(apply(df_new, 2, is.na)) < 1, ]

> df_new[rowSums(apply(df_new, 2, is.na)) < 1, ]
    word year mean_i
1 banana 2000      3
2   pear 2000      4
4   pear 2001      5

Note: use function(x) instead of \(x) if you use a version of R <4.1.0

How to compute the mean for certain category rows?

Related

Recent Posts