Filtering out duplicated/non-unique rows in data.table

Edit 2019: This question was asked prior to changes in data.table in November 2016, see the accepted answer below for both the current and previous methods.

I have a data.table table with about 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously for a data.frame I would have done this: df -> unique(df[,c('V1', 'V2')]) but this doesn't work with data.table. I have tried unique(df[,c(V1,V2), with=FALSE]) but it seems to still only operate on the key of the data.table and not the whole row.

Any suggestions?

Cheers, Davy

Example

>dt
      V1   V2
[1,]  A    B
[2,]  A    C
[3,]  A    D
[4,]  A    B
[5,]  B    A
[6,]  C    D
[7,]  C    D
[8,]  E    F
[9,]  G    G
[10,] A    B

in the above data.table where V2 is the table key, only rows 4,7, and 10 would be removed.

> dput(dt)
structure(list(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", 
"E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", 
"G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")

For v1.9.8+ (released November 2016)

From ?unique.data.table By default all columns are being used (which is consistent with ?unique.data.frame)

unique(dt)
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  C  D
6:  E  F
7:  G  G

Or using the by argument in order to get unique combinations of specific columns (like previously keys were used for)

unique(dt, by = "V2")
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  E  F
6:  G  G

Prior v1.9.8

From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.

library(data.table)
dt <- data.table(
  V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
  V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)

Calling unique with one column as key:

setkey(dt, "V2")
unique(dt)
     V1 V2
[1,]  B  A
[2,]  A  B
[3,]  A  C
[4,]  A  D
[5,]  E  F
[6,]  G  G

With your example data.table...

> dt<-data.table(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", "E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", "G"))
> setkey(dt,V2)

Consider the following tests:

> haskey(dt) # obviously dt has a key, since we just set it
[1] TRUE

> haskey(dt[,list(V1,V2)]) # ... but this is treated like a "new" table, and does not have a key
[1] FALSE

> haskey(dt[,.SD]) # note that this still has a key
[1] TRUE

So, you can list the columns of the table and then take the unique() of that, with no need for setting the key to all columns or dropping it (by setting it to NULL) as required by the solution from @Andrie (and edited by @MatthewDowle). The solutions suggested by @Pop and @Rahul didn't work for me.

See Try 3 below, which is very similar to your initial try. Your example was not clear so I'm not sure why it didn't work. Also it was a few months ago when you posted the question, so maybe data.table was updated?

> unique(dt) # Try 1: wrong answer (missing V1=C and V2=D)
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  E  F
6:  G  G

> dt[!duplicated(dt)] # Try 2: wrong answer (missing V1=C and V2=D)
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  E  F
6:  G  G

> unique(dt[,list(V1,V2)]) # Try 3: correct answer; does not require modifying key
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  C  D
6:  E  F
7:  G  G

> setkey(dt,NULL)
> unique(dt) # Try 4: correct answer; requires key to be removed
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  C  D
6:  E  F
7:  G  G

unique(df) works on your example.

This should work for you

dt <- unique(dt, by = c('V1', 'V2'))

Filtering out duplicated/non-unique rows in data.table

Related

Recent Posts