New variable for cumulative count of one variable in another variable
I have previously asked a question about how to create a variable which counts instances of one ID number appearing in another column (see here). However, I have now realised that I need the new columns to hold a cumulative sum of the number of times the victim (in a particular crime incident) has been recorded as a suspect (in incidents occurring previous to that incident), and one which counts the number of times the suspect (in a particular crime incident) has been recorded as a victim (in incidents occurring previous to that incident). My data is already ordered by date so all I need is a way of cumulatively counting.
Here's a simplified version of my data:
s.uid | v.uid | |
---|---|---|
1 | 1 | 3 |
2 | 2 | 9 |
3 | 3 | 8 |
4 | 4 | 5 |
5 | 5 | 2 |
6 | 9 | 2 |
7 | NA | 7 |
8 | 5 | 9 |
9 | 9 | 5 |
And here is what I want to create:
s.uid | v.uid | s.in.v | v.in.s | |
---|---|---|---|---|
1 | 1 | 3 | 0 | 0 |
2 | 2 | 9 | 0 | 0 |
3 | 3 | 8 | 1 | 0 |
4 | 4 | 5 | 0 | 0 |
5 | 5 | 2 | 1 | 1 |
6 | 9 | 2 | 1 | 1 |
7 | NA | 7 | NA | 0 |
8 | 5 | 9 | 1 | 1 |
9 | 9 | 5 | 2 | 2 |
Note that, where there is an NA, I would like the NA to be preserved. I'm currently trying to work in tidyverse and piping where possible, so I would prefer answers in that kind of format, but I'm open to any solution!
I tried adapting user438383's answer to my previous question but it threw an error (I'm quite new to R so I wasn't sure what this meant!):
# DUMMY DATA TEST
s.uid <- c(1:5, 9, NA, 5, 9)
v.uid <- c(3, 9, 8, 5, 2, 2, 7, 9, 5)
dat <- tibble(s.uid, v.uid)
dat %>%
group_by(s.uid) %>%
mutate(s.in.v = cumsum(dat$v.uid %in% s.uid)) %>%
group_by(v.uid) %>%
mutate(v.in.s = cumsum(dat$s.uid %in% v.uid))
Error: Problem with `mutate()` input `s.in.v`.
x Input `s.in.v` can't be recycled to size 1.
ℹ Input `s.in.v` is `cumsum(dat$v.uid %in% s.uid)`.
ℹ Input `s.in.v` must be size 1, not 9.
ℹ The error occurred in group 1: s.uid = 1.
One approach is to use the magrittr pipe placeholder dot, and a rowwise
approach, summing the number of TRUE
values of a subset of the column.
dat %>%
mutate(n = row_number()) %>%
rowwise() %>%
mutate(s.in.v = ifelse(is.na(s.uid), NA, sum(s.uid == .$v.uid[1:n], na.rm = T)),
v.in.s = ifelse(is.na(v.uid), NA, sum(v.uid == .$s.uid[1:n], na.rm = T))) %>%
ungroup() %>%
select(-n)
# A tibble: 9 x 5
n s.uid v.uid s.in.v v.in.s
<int> <dbl> <dbl> <int> <int>
1 1 1 3 0 0
2 2 2 9 0 0
3 3 3 8 1 0
4 4 4 5 0 0
5 5 5 2 1 1
6 6 9 2 1 1
7 7 NA 7 NA 0
8 8 5 9 1 1
9 9 9 5 2 2
Note that this is likely not computationally efficient.