Create new variable using year first treated

I have a dataset where groups undergo treatments at different times, and I need to log the year in which the groups first become treated, else giving the value of 0 for all non-treated groups.

 df = pd.DataFrame([['CA',2014,0],['CA',2015,0],['CA',2016,1],['CA',2017,1], 
 ['WA',2011,0],['WA',2012,1],['WA',2013,1],['TX',2010,0]],
 columns=['Group_ID','Year','Treated'])

The dataframe should look like this once complete:

Group_ID Year Treated First_Treated
CA 2014 0 0
CA 2015 0 0
CA 2016 1 2016
CA 2017 1 2016
WA 2011 0 0
WA 2012 1 2012
WA 2013 1 2012
TX 2010 0 0

The Python code below returns every subsequent year value rather than the first year of treatment.

df['first_treated'] = np.where(df['Treated']==1, df['Year'], 0)

I have tried agg() and min() functions but neither work properly.

df['first_treated'] = np.where(df['Treated']==1,df['Year'].min,0)

I have also used the R code in Create a group variable first.treat indicating the first year when each unit becomes treated, but using an empty first_treated column, no data is inserted into the column with the mutate() function. I receive no errors using that R script on the similar pandas dataframe.


Solution 1:

We can create the group key with cumsum , then transform the first value assign it back

s = df['Treated'].eq(0)
df['new'] = df[~s].groupby(df['Treated'].eq(0).cumsum())['Year'].transform('first')
df.new.fillna(0,inplace=True)
#df.new = df.new.astype(int)
df
  Group_ID  Year  Treated     new
0       CA  2014        0     0.0
1       CA  2015        0     0.0
2       CA  2016        1  2016.0
3       CA  2017        1  2016.0
4       WA  2011        0     0.0
5       WA  2012        1  2012.0
6       WA  2013        1  2012.0
7       TX  2010        0     0.0

Solution 2:

For an R version, we can use the tidyverse package to group_by(group_id) and then use mutate() to create the first_treated column, similar to how you've attempted the problem. We can use the lubridate package to easily handle dates.

df = data.frame(
group_id = c(rep("CA", 4), rep("WA", 3), "TX"), 
year = c("2014", "2015", "2016", "2017", "2011", "2012", "2013", "2010"), 
treated = c(0,0,1,1,0,1,1,0))

df %>% 
group_by(group_id) %>%
mutate(year = lubridate::as_date(year, format = '%Y'),
first_treated = ifelse(treated == 1, min(lubridate::year(year)), 0))