Select previous and next N rows with the same value as a certain row

I construct the following panel data with keys id and time:

pdata <- tibble(
  id = rep(1:10, each = 5),
  time = rep(2016:2020, times = 10),
  value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
# A tibble: 50 × 3
      id  time value
   <int> <int> <dbl>
 1     1  2016     1
 2     1  2017     1
 3     1  2018     1
 4     1  2019     0
 5     1  2020     0
 6     2  2016     1
 7     2  2017     1
 8     2  2018     0
 9     2  2019     0
10     2  2020     0
# … with 40 more rows

Let's assume a shock happened in 2018. I wish to slice pairs of previous and next N rows by id that have the same value as the shock rows' value.

I take several examples for illustration. For id == 5, the dataset looks like:

pdata %>% filter(id == 5)
# A tibble: 5 × 3
     id  time value
  <int> <int> <dbl>
1     5  2016     1
2     5  2017     0
3     5  2018     0
4     5  2019     0
5     5  2020     1

The value in 2018 for id == 5 is 0, and I wish to keep the previous and next 1 row including the current row because all these observations have the same value that equals 0:

# A tibble: 3 × 3
     id  time value
  <int> <int> <dbl>
1     5  2017     0
2     5  2018     0
3     5  2019     0

For id == 8, I wish to get:

# A tibble: 5 × 3
     id  time value
  <int> <int> <dbl>
1     8  2016     1
2     8  2017     1
3     8  2018     1
4     8  2019     1
5     8  2020     1

For id == 1, I wish to get the empty dataset, since the pair of the observation in 2017 and the observation in 2019 does not have the same value.

The final dataset should be:

# A tibble: 19 × 3
      id  time value
   <int> <int> <dbl>
 1     4  2016     0
 2     4  2017     0
 3     4  2018     0
 4     4  2019     0
 5     4  2020     0
 6     5  2017     0
 7     5  2018     0
 8     5  2019     0
 9     6  2017     1
10     6  2018     1
11     6  2019     1
12     7  2017     1
13     7  2018     1
14     7  2019     1
15     8  2016     1
16     8  2017     1
17     8  2018     1
18     8  2019     1
19     8  2020     1

A solution with data.table:

# load the package & convert data to a data.table

# define shock-year and number of previous/next rows
shock <- 2018
n <- 2

# filter
pdata[, .SD[value == value[time == shock] &
              between(time, shock - n, shock + n) & 
              value == rev(value)][.N > 1 & all(diff(time) == 1)]
      , by = id]

which gives:

    id time value
 1:  4 2016     0
 2:  4 2017     0
 3:  4 2018     0
 4:  4 2019     0
 5:  4 2020     0
 6:  5 2017     0
 7:  5 2018     0
 8:  5 2019     0
 9:  6 2017     1
10:  6 2018     1
11:  6 2019     1
12:  7 2017     1
13:  7 2018     1
14:  7 2019     1
15:  8 2016     1
16:  8 2017     1
17:  8 2018     1
18:  8 2019     1
19:  8 2020     1

Used data:

pdata <- data.frame(
  id = rep(1:10, each = 5),
  time = rep(2016:2020, times = 10),
  value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))

Symmetrical range around focal year & range may differ among 'id'

Within each 'id' (by = id), use rleid to create a grouping variable 'r' based on runs of equal values. Within each 'id' and run (by = .(id, r)), check if at least previous and next year from the focal year (e.g. 2018) are present (if(sum(time %in% yr_rng) == 3)). If so, select equal number of rows before and after the focal year (min(c(shock - .I[1], .I[.N] - shock)). Note that here the number of years selected may vary among 'id'.

yr = 2018
yr_rng = (yr - 1):(yr + 1)

pdata[ , r := rleid(value), by = id]
pdata[pdata[ , if(sum(time %in% yr_rng) == 3) {
  shock = .I[time == 2018]
  rng = min(c(shock - .I[1], .I[.N] - shock))
  (shock - rng):(shock + rng)
}, by = .(id, r)]$V1] 

    id time value r
 1:  4 2016     0 1
 2:  4 2017     0 1
 3:  4 2018     0 1
 4:  4 2019     0 1
 5:  4 2020     0 1
 6:  5 2017     0 2
 7:  5 2018     0 2
 8:  5 2019     0 2
 9:  6 2017     1 2
10:  6 2018     1 2
11:  6 2019     1 2
12:  7 2017     1 2
13:  7 2018     1 2
14:  7 2019     1 2
15:  8 2016     1 1
16:  8 2017     1 1
17:  8 2018     1 1
18:  8 2019     1 1
19:  8 2020     1 1

Allowing asymmetrical range around focal year

Within each 'id' and run (by = .(id, r)), check if both previous and next year from the focal year (e.g. 2018) are present (if(sum(time %in% yr_rng) == 3)). If so, select the entire group (.SD).

pdata[ , r := rleid(value), by = id]
pdata[ , if(sum(time %in% yr_rng) == 3) .SD, by = .(id, r)]

    id r time value
 1:  4 1 2016     0
 2:  4 1 2017     0
 3:  4 1 2018     0
 4:  4 1 2019     0
 5:  4 1 2020     0
 6:  5 2 2017     0
 7:  5 2 2018     0
 8:  5 2 2019     0
 9:  6 2 2017     1
10:  6 2 2018     1
11:  6 2 2019     1
12:  7 2 2017     1
13:  7 2 2018     1
14:  7 2 2019     1
15:  7 2 2020     1
16:  8 1 2016     1
17:  8 1 2017     1
18:  8 1 2018     1
19:  8 1 2019     1
20:  8 1 2020     1

As far as I understood, here's a dplyr suggestion:


MyF <- function(id2, shock, nb_row) {
  values <- pdata %>%
    filter(id == id2) %>%
  if (length(unique(values)) == 1) {
    pdata %>%
      filter(id == id2)
  } else {
    pdata %>%
      filter(id == id2) %>%
      filter(time >= shock - nb_row & time <= shock + nb_row) %>%
      filter(length(unique(value)) == 1)

map_df(pdata %>%
         select(id) %>% 
         distinct() %>% 
       shock = 2018, nb_row = 1)

## Or map_df(1:8,MyF,shock = 2018, nb_row = 1)


# A tibble: 19 x 3
      id  time value
   <int> <int> <dbl>
 1     4  2016     0
 2     4  2017     0
 3     4  2018     0
 4     4  2019     0
 5     4  2020     0
 6     5  2017     0
 7     5  2018     0
 8     5  2019     0
 9     6  2017     1
10     6  2018     1
11     6  2019     1
12     7  2017     1
13     7  2018     1
14     7  2019     1
15     8  2016     1
16     8  2017     1
17     8  2018     1
18     8  2019     1
19     8  2020     1