Remove/collapse consecutive duplicate values in sequence

I have the following dataframe:

a a a b c c d e a a b b b e e d d

The required result should be

a b c d e a b e d 

It means no two consecutive rows should have same value. How it can be done without using loop.

As my data set is quite huge, looping is taking lot of time to execute.

The dataframe structure is like the following

a 1 
a 2
a 3
b 2
c 4
c 1
d 3
e 9
a 4
a 8
b 10
b 199
e 2
e 5
d 4
d 10

Result:

a 1 
b 2
c 4
d 3
e 9
a 4
b 10
e 2
d 4

Its should delete the entire row.


Solution 1:

One easy way is to use rle:

Here's your sample data:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items

rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").

rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Update: For a data.frame

If you are working with a data.frame, try something like the following:

## Sample data
mydf <- data.frame(
  V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
  V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10)
)

## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1]  1  4  5  7  8  9 11 13 15
mydf[Y, ]
#    V1 V2
# 1   a  1
# 4   b  2
# 5   c  4
# 7   d  3
# 8   e  9
# 9   a  4
# 11  b 10
# 13  e  2
# 15  d  4

Update 2

The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:

library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
#    rleid V2
# 1:     1  1
# 2:     2  2
# 3:     3  4
# 4:     4  3
# 5:     5  9
# 6:     6  4
# 7:     7 10
# 8:     8  2
# 9:     9  4

Solution 2:

library(dplyr)
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=lag(x, default=1)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

EDIT: For data.frame

  mydf <- data.frame(
    V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
    V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10),
   stringsAsFactors=FALSE)

dplyr solution is one liner:

mydf %>% filter(V1!= lag(V1, default="1"))
#  V1 V2
#1  a  1
#2  b  2
#3  c  4
#4  d  3
#5  e  9
#6  a  4
#7  b 10
#8  e  2
#9  d  4

post scriptum

lead(x,1) suggested by @Carl Witthoft iterates in reverse order.

leadit<-function(x) x!=lead(x, default="what")
rows <- leadit(mydf[ ,1])
mydf[rows, ]

#   V1  V2
#3   a   3
#4   b   2
#6   c   1
#7   d   3
#8   e   9
#10  a   8
#12  b 199
#14  e   5
#16  d  10

Solution 3:

With base R, I like funny algorithmics:

x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")

x[x!=c(x[-1], FALSE)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Solution 4:

Much as I like,... errr, love rle , here's a shootoff:

EDIT: Can't figure out exactly what's up with dplyr so I used dplyr::lead . I'm on OSX, R3.1.2, and latest dplyr from CRAN.

xlet<-sample(letters,1e5,rep=T)
rleit<-function(x) rle(x)$values
lagit<-function(x) x[x!=lead(x, default=1)]
tailit<-function(x) x[x!=c(tail(x,-1), tail(x,1))]



  microbenchmark(rleit(xlet),lagit(xlet),tailit(xlet),times=20)
Unit: milliseconds
         expr      min       lq   median       uq      max neval
  rleit(xlet) 27.43996 30.02569 30.20385 30.92817 37.10657    20
  lagit(xlet) 12.44794 15.00687 15.14051 15.80254 46.66940    20
 tailit(xlet) 12.48968 14.66588 14.78383 15.32276 55.59840    20