Create new variable using year first treated
I have a dataset where groups undergo treatments at different times, and I need to log the year in which the groups first become treated, else giving the value of 0 for all non-treated groups.
df = pd.DataFrame([['CA',2014,0],['CA',2015,0],['CA',2016,1],['CA',2017,1],
['WA',2011,0],['WA',2012,1],['WA',2013,1],['TX',2010,0]],
columns=['Group_ID','Year','Treated'])
The dataframe should look like this once complete:
Group_ID | Year | Treated | First_Treated |
---|---|---|---|
CA | 2014 | 0 | 0 |
CA | 2015 | 0 | 0 |
CA | 2016 | 1 | 2016 |
CA | 2017 | 1 | 2016 |
WA | 2011 | 0 | 0 |
WA | 2012 | 1 | 2012 |
WA | 2013 | 1 | 2012 |
TX | 2010 | 0 | 0 |
The Python code below returns every subsequent year
value rather than the first year
of treatment.
df['first_treated'] = np.where(df['Treated']==1, df['Year'], 0)
I have tried agg()
and min()
functions but neither work properly.
df['first_treated'] = np.where(df['Treated']==1,df['Year'].min,0)
I have also used the R code in Create a group variable first.treat indicating the first year when each unit becomes treated, but using an empty first_treated
column, no data is inserted into the column with the mutate()
function. I receive no errors using that R script on the similar pandas dataframe.
Solution 1:
We can create the group key with cumsum
, then transform
the first
value assign it back
s = df['Treated'].eq(0)
df['new'] = df[~s].groupby(df['Treated'].eq(0).cumsum())['Year'].transform('first')
df.new.fillna(0,inplace=True)
#df.new = df.new.astype(int)
df
Group_ID Year Treated new
0 CA 2014 0 0.0
1 CA 2015 0 0.0
2 CA 2016 1 2016.0
3 CA 2017 1 2016.0
4 WA 2011 0 0.0
5 WA 2012 1 2012.0
6 WA 2013 1 2012.0
7 TX 2010 0 0.0
Solution 2:
For an R version, we can use the tidyverse package to group_by(group_id)
and then use mutate()
to create the first_treated
column, similar to how you've attempted the problem. We can use the lubridate package to easily handle dates.
df = data.frame(
group_id = c(rep("CA", 4), rep("WA", 3), "TX"),
year = c("2014", "2015", "2016", "2017", "2011", "2012", "2013", "2010"),
treated = c(0,0,1,1,0,1,1,0))
df %>%
group_by(group_id) %>%
mutate(year = lubridate::as_date(year, format = '%Y'),
first_treated = ifelse(treated == 1, min(lubridate::year(year)), 0))