Group integer vector into consecutive runs
I have two vectors of integer. I would like to identify the intervals of consecutive integer sequences presented in the second vector conditioned by the first vector (this vector can be seen as a factor, by which the second vector can be classified into several groups).
Here I present a dummy for my problem.
The data, in one group (defined by the first vector) of the second vector, the integers monotonically increase.
my.data <- data.frame(
V1=c(rep(1, 10), rep(2, 9), rep(3,11)),
V2=c(seq(2,5), seq(7,11), 13, seq(4, 9), seq(11,13), seq(1, 6), seq(101, 105))
)
What I want:
- output the begin and end of the interval
- here, group in the first column, the beginning integer in the second, the end integer in the third.
Expected results:
1, 2, 5 \n
1, 7, 11 \n
1, 13, 13 \n
2, 4, 9 \n
2, 11, 13 \n
3, 1, 6 \n
3, 101, 105 \n
Here's a brief answer using aggregate....
runs <- cumsum( c(0, diff(my.data$V2) > 1) )
aggregate(V2 ~ runs + V1, my.data, range)[,-1]
V1 V2.1 V2.2
1 1 2 5
2 1 7 11
3 1 13 13
4 2 4 9
5 2 11 13
6 3 1 6
7 3 101 105
A while back, I wrote a variant of rle()
which I named seqle()
because it allows one to look for integer sequences rather than repetitions. Then, you can do:
Rgames: seqle(my.data[my.data$V1==1,2]) #repeat for my.data$V1 equal to 2 and 3
$lengths
[1] 4 5 1
$values
[1] 2 7 13
(for example). It would take a little fiddling to get these results into the tabular form you want, but just thought I'd mention it. BTW, here's the code for seqle
. If you set incr=0
you get the base rle result.
function(x,incr=1){
if(!is.numeric(x)) x <- as.numeric(x)
n <- length(x)
y <- x[-1L] != x[-n] + incr
i <- c(which(y|is.na(y)),n)
list( lengths = diff(c(0L,i)), values = x[head(c(0L,i)+1L,-1L)])
}
EDIT: There's an excellent upgrade to this, provided by flodel, at How to check if a vector contains n consecutive numbers . He pointed out that this version has the usual floating-point error problems when working with doubles, and provided a fix as well.
here is an example:
library(plyr)
ddply(my.data, .(V1),
function(x) data.frame(do.call("rbind", tapply(x$V2, cumsum(c(T, diff(x$V2)!=1)),
function(y) c(min(y), max(y))))))
maybe, too complicated, but what is important is the cumsum(c(T, diff(x$V2)!=1))
.
> ddply(my.data, .(V1),
+ function(x) data.frame(do.call("rbind", tapply(x$V2, cumsum(c(T, diff(x$V2)!=1)),
+ function(y) c(min(y), max(y))))))
V1 X1 X2
1 1 2 5
2 1 7 11
3 1 13 13
4 2 4 9
5 2 11 13
6 3 1 6
7 3 101 105
Here's a solution using ddply
from the plyr
package. The basic idea is to see when diff(x)
isn't 1, in order to find the changeover points.
ddply(
my.data,
.(V1),
summarise,
lower =
{
cut_points <- which(diff(V2) != 1)
V2[c(1, cut_points + 1)]
},
upper =
{
cut_points <- which(diff(V2) != 1)
V2[c(cut_points, length(V2))]
}
)
my.data$run <- ave(my.data$V2, my.data$V1, FUN=function(x) c(1, diff(x)))
strstp <- by(my.data, list(my.data$V1),
FUN=function(x) list(
starts=c( head(x$V2,1), x$V2[x$run != 1]),
stops=c(x$V2[which(x$run != 1)-1], tail(x$V2, 1))))
> strstp
: 1
$starts
[1] 2 7 13
$stops
[1] 5 11 13
-------------------------------------------------------------
: 2
$starts
[1] 4 11
$stops
[1] 9 13
-------------------------------------------------------------
: 3
$starts
[1] 1 101
$stops
[1] 6 105