How to compute the mean for certain category rows?
Assuming this data:
df <- tibble(
word = c("apple", "apple","banana", "pear","pear","A","A","A"),
i = seq_along(word),year=c(2000,2001,2000,2000,2001,2000,2001,2002)
)
I want to calculate the mean of column i for all words (per year) in column word without A and banana. and give the name out the output would be
word i year
<chr> <int> <dbl>
1 out 2.5 2000
2 out 3.5 2001
3 banana 4 2000
5 A 7 2000
8 A 8 2001
9 A 9 2002
We may group by 'year', and the 'word' after replace
ing the elements that are not 'A', 'banana' with 'out' and get the mean
of 'i' and order the rows by 'group' (arrange
) if neccessary
library(dplyr)
df %>%
group_by(year, word = replace(word, !(word %in% c('A', 'banana')), 'out')) %>%
summarise(i = mean(i), .groups = 'drop') %>%
arrange(factor(word, levels = c('out', 'banana', 'A'))) %>%
select(names(df))
First we replace all not banana
and A
with out
in an ifelse
statement, then group and summarise. Finally arrange:
library(dplyr)
df %>%
mutate(word = ifelse(word != "banana" & word !="A", "out", word)) %>%
group_by(year, word) %>%
summarise(mean_i = mean(i)) %>%
arrange(desc(word))
year word mean_i
<dbl> <chr> <dbl>
1 2000 out 2.5
2 2001 out 3.5
3 2000 banana 3
4 2000 A 6
5 2001 A 7
6 2002 A 8
Using data.table
:
library(data.table)
dt <- as.data.table(df)
# assign name "mean" to the new column with average (mean) values
> dt[!(word %in% c('banana', 'A') ), .(mean = mean(i)), by = .(year, word)]
year word mean
1: 2000 apple 1
2: 2001 apple 2
3: 2000 pear 4
4: 2001 pear 5
Base R solution
We first split
the data by word and, within word, year. Then give us the mean
of i. Store the mean values of i pertaining to all words but 'A' and 'apple' in out
. Finally, make a data.frame
with columns for word, year, and our means and use some regex
to obtain tidy column names.
means <- sapply(split(df, f = ~ df$word + df$year), \(x) mean(x$i))
out <- means[!grepl('^[Aa]', names(means))]
df_new <- data.frame(
word=gsub('\\.[0-9]*', '', names(out)),
year=gsub('[^[0-9]]*', '', names(out)),
mean_i=out
)
rownames(df_new) <- NULL
Output
> df_new
word year mean_i
1 banana 2000 3
2 pear 2000 4
3 banana 2001 NaN
4 pear 2001 5
5 banana 2002 NaN
6 pear 2002 NaN
If you want to omit rows containing missing values of mean_i
, then you could use something like df_new[rowSums(apply(df_new, 2, is.na)) < 1, ]
> df_new[rowSums(apply(df_new, 2, is.na)) < 1, ]
word year mean_i
1 banana 2000 3
2 pear 2000 4
4 pear 2001 5
Note: use function(x)
instead of \(x)
if you use a version of R <4.1.0