creating new column based on other grouped variables and multiple conditions
suppose I have following data:
df1<- data.frame(province= c(1,1,2,3,3,3,4,4,4,4,4,5,5,5),year= c(2001,2001,2001,2001,2001,2001,2002,2002,2003,2003,2003,2004,2005,2005),
residence= c(1,1,1,2,2,2,1,1,1,2,2,2,2,2),marriage= c(1,2,2,1,2,1,1,1,2,1,1,1,2,1),count=c(4,1,3,5,3,2,2,3,2,1,2,4,2,5))
in my data marriage = 1
is ever-married
and marriage = 2
is never-married
. the proportion of ever-married
can be estimated by column count
: ever-married / ever-married + never-married
what I want is estimating the proportion of ever-married
based on columns province
, year
and residence
and two conditions:
1- if there is no ever-married based on three columns, the proportion would be 0
2- if there is no never-married based on three columns, the proportion would be 100
.
my expected output would be like this:
province year residence sub
1 2001 1 0.80
2 2001 1 0.00
3 2001 2 0.70
4 2002 1 100.00
4 2003 1 0.00
4 2003 2 100.00
5 2004 2 100.00
5 2005 2 0.71
thank you in advance.
We group by 'province', 'year', 'residence', create a condition based on if/else
when 'marriage' values 1, 2 are not present, then return 0, 100 respectively or else
get the 'count' values that correspond to 'marriage' value of 1, divide by the sum
of 'count' and then sum
the proportions
library(dplyr)
df1 %>%
group_by(province, year, residence) %>%
summarise(sub = if(!any(marriage == 1)) 0
else if(!any(marriage == 2)) 100 else
sum(count[marriage == 1]/sum(count)), .groups = 'drop')
-output
# A tibble: 8 × 4
province year residence sub
<dbl> <dbl> <dbl> <dbl>
1 1 2001 1 0.8
2 2 2001 1 0
3 3 2001 2 0.7
4 4 2002 1 100
5 4 2003 1 0
6 4 2003 2 100
7 5 2004 2 100
8 5 2005 2 0.714