Cumulative sum until maximum reached, then repeat from zero in the next row
I feel like this is a fairly easy question, but for the life of me I can't seem to find the answer. I have a fairly standard dataframe, and what I am trying to do is sum the a column of values until they reach some value (either that exact value or greater than it), at which point it drops a 1 into a new column (labelled keep) and restarts the summing at 0.
I have a column of minutes, the differences between the minutes, a keep column, and a cumulative sum column (the example I am using is much cleaner than the actual full dataset)
minutes difference keep difference_sum
1052991158 0 0 0
1052991338 180 0 180
1052991518 180 0 360
1052991698 180 0 540
1052991878 180 0 720
1052992058 180 0 900
1052992238 180 0 1080
1052992418 180 0 1260
1052992598 180 0 1440
1052992778 180 0 1620
1052992958 180 0 1800
The difference sum column was calculated with the code
caribou.sub$difference_sum<-cumsum(difference)
What I would like to do is run the above code with the condition that, when the summed value reaches either 1470 or any number greater than that it puts a 1 in the keep column and then restarts summing afterwards, and continues running throughout the dataset.
Thanks in advance, and if you need any more information let me know.
Ayden
I think this is best done with a for loop, can't think of a function that could do so out of the box. The following should do what you want (if I understand you correctly).
current.sum <- 0
for (c in 1:nrow(caribou.sub)) {
current.sum <- current.sum + caribou.sub[c, "difference"]
carribou.sub[c, "difference_sum"] <- current.sum
if (current.sum >= 1470) {
caribou.sub[c, "keep"] <- 1
current.sum <- 0
}
}
Feel free to comment if it does not exactly what you want. But as pointed out by alexwhan, your description is not completely clear.
Assuming your data.frame
is df
:
df$difference_sum <- c(0, head(cumsum(df$difference), -1))
# get length of 0's (first keep value gives the actual length)
len <- sum(df$difference_sum %/% 1470 == 0)
df$keep <- (seq_len(nrow(df))-1) %/% len
df <- transform(df, difference_sum = ave(difference, keep,
FUN=function(x) c(0, head(cumsum(x), -1))))
# minutes difference keep difference_sum
# 1 1052991158 180 0 0
# 2 1052991338 180 0 180
# 3 1052991518 180 0 360
# 4 1052991698 180 0 540
# 5 1052991878 180 0 720
# 6 1052992058 180 0 900
# 7 1052992238 180 0 1080
# 8 1052992418 180 0 1260
# 9 1052992598 180 0 1440
# 10 1052992778 180 1 0
# 11 1052992958 180 1 180
I still don't understand about when the sum should restart and if it should be zero then. A desired result would help greatly.
Nonetheless, I can't help but think that simply indexing and subtraction would be a straightforward way of doing this. The below code gives the same result as @Henrik's solution.
df$difference_sum <- cumsum(df$difference)
step <- (df$difference_sum %/% 1470) + 1
k <- which(diff(step) > 0) + 1
df$keep <- 0
df$keep[k] <- 1
step[k] <- step[k] - 1
df$difference_sum <- df$difference_sum - c(0, df$difference_sum[k])[step]