How to delete duplicates on a MySQL table?

I need to DELETE duplicated rows for specified sid on a MySQL table.

How can I do this with an SQL query?

DELETE (DUPLICATED TITLES) FROM table WHERE SID = "1"

Something like this, but I don't know how to do it.

This removes duplicates in place, without making a new table.

ALTER IGNORE TABLE `table_name` ADD UNIQUE (title, SID)

Note: This only works well if index fits in memory.

Suppose you have a table employee, with the following columns:

employee (first_name, last_name, start_date)

In order to delete the rows with a duplicate first_name column:

delete
from employee using employee,
    employee e1
where employee.id > e1.id
    and employee.first_name = e1.first_name

Deleting duplicate rows in MySQL in-place, (Assuming you have a timestamp col to sort by) walkthrough:

Create the table and insert some rows:

create table penguins(foo int, bar varchar(15), baz datetime);
insert into penguins values(1, 'skipper', now());
insert into penguins values(1, 'skipper', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(4, 'rico', now());
select * from penguins;
    +------+----------+---------------------+
    | foo  | bar      | baz                 |
    +------+----------+---------------------+
    |    1 | skipper  | 2014-08-25 14:21:54 |
    |    1 | skipper  | 2014-08-25 14:21:59 |
    |    3 | kowalski | 2014-08-25 14:22:09 |
    |    3 | kowalski | 2014-08-25 14:22:13 |
    |    3 | kowalski | 2014-08-25 14:22:15 |
    |    4 | rico     | 2014-08-25 14:22:22 |
    +------+----------+---------------------+
6 rows in set (0.00 sec)

Remove the duplicates in place:

delete a
    from penguins a
    left join(
    select max(baz) maxtimestamp, foo, bar
    from penguins
    group by foo, bar) b
    on a.baz = maxtimestamp and
    a.foo = b.foo and
    a.bar = b.bar
    where b.maxtimestamp IS NULL;
Query OK, 3 rows affected (0.01 sec)
select * from penguins;
+------+----------+---------------------+
| foo  | bar      | baz                 |
+------+----------+---------------------+
|    1 | skipper  | 2014-08-25 14:21:59 |
|    3 | kowalski | 2014-08-25 14:22:15 |
|    4 | rico     | 2014-08-25 14:22:22 |
+------+----------+---------------------+
3 rows in set (0.00 sec)

You're done, duplicate rows are removed, last one by timestamp is kept.

For those of you without a timestamp or unique column.

You don't have a timestamp or a unique index column to sort by? You're living in a state of degeneracy. You'll have to do additional steps to delete duplicate rows.

create the penguins table and add some rows

create table penguins(foo int, bar varchar(15)); 
insert into penguins values(1, 'skipper'); 
insert into penguins values(1, 'skipper'); 
insert into penguins values(3, 'kowalski'); 
insert into penguins values(3, 'kowalski'); 
insert into penguins values(3, 'kowalski'); 
insert into penguins values(4, 'rico'); 
select * from penguins; 
    # +------+----------+ 
    # | foo  | bar      | 
    # +------+----------+ 
    # |    1 | skipper  | 
    # |    1 | skipper  | 
    # |    3 | kowalski | 
    # |    3 | kowalski | 
    # |    3 | kowalski | 
    # |    4 | rico     | 
    # +------+----------+

make a clone of the first table and copy into it.

drop table if exists penguins_copy; 
create table penguins_copy as ( SELECT foo, bar FROM penguins );  

#add an autoincrementing primary key: 
ALTER TABLE penguins_copy ADD moo int AUTO_INCREMENT PRIMARY KEY first; 

select * from penguins_copy; 
    # +-----+------+----------+ 
    # | moo | foo  | bar      | 
    # +-----+------+----------+ 
    # |   1 |    1 | skipper  | 
    # |   2 |    1 | skipper  | 
    # |   3 |    3 | kowalski | 
    # |   4 |    3 | kowalski | 
    # |   5 |    3 | kowalski | 
    # |   6 |    4 | rico     | 
    # +-----+------+----------+

The max aggregate operates upon the new moo index:

delete a from penguins_copy a left join( 
    select max(moo) myindex, foo, bar 
    from penguins_copy 
    group by foo, bar) b 
    on a.moo = b.myindex and 
    a.foo = b.foo and 
    a.bar = b.bar 
    where b.myindex IS NULL; 

#drop the extra column on the copied table 
alter table penguins_copy drop moo; 
select * from penguins_copy; 

#drop the first table and put the copy table back: 
drop table penguins; 
create table penguins select * from penguins_copy;

observe and cleanup

drop table penguins_copy; 
select * from penguins;
+------+----------+ 
| foo  | bar      | 
+------+----------+ 
|    1 | skipper  | 
|    3 | kowalski | 
|    4 | rico     | 
+------+----------+ 
    Elapsed: 1458.359 milliseconds

What's that big SQL delete statement doing?

Table penguins with alias 'a' is left joined on a subset of table penguins called alias 'b'. The right hand table 'b' which is a subset finds the max timestamp [ or max moo ] grouped by columns foo and bar. This is matched to left hand table 'a'. (foo,bar,baz) on left has every row in the table. The right hand subset 'b' has a (maxtimestamp,foo,bar) which is matched to left only on the one that IS the max.

Every row that is not that max has value maxtimestamp of NULL. Filter down on those NULL rows and you have a set of all rows grouped by foo and bar that isn't the latest timestamp baz. Delete those ones.

Make a backup of the table before you run this.

Prevent this problem from ever happening again on this table:

If you got this to work, and it put out your "duplicate row" fire. Great. Now define a new composite unique key on your table (on those two columns) to prevent more duplicates from being added in the first place.

Like a good immune system, the bad rows shouldn't even be allowed in to the table at the time of insert. Later on all those programs adding duplicates will broadcast their protest, and when you fix them, this issue never comes up again.

Following remove duplicates for all SID-s, not only single one.

With temp table

CREATE TABLE table_temp AS
SELECT * FROM table GROUP BY title, SID;

DROP TABLE table;
RENAME TABLE table_temp TO table;

Since temp_table is freshly created it has no indexes. You'll need to recreate them after removing duplicates. You can check what indexes you have in the table with SHOW INDEXES IN table

Without temp table:

DELETE FROM `table` WHERE id IN (
  SELECT all_duplicates.id FROM (
    SELECT id FROM `table` WHERE (`title`, `SID`) IN (
      SELECT `title`, `SID` FROM `table` GROUP BY `title`, `SID` having count(*) > 1
    )
  ) AS all_duplicates 
  LEFT JOIN (
    SELECT id FROM `table` GROUP BY `title`, `SID` having count(*) > 1
  ) AS grouped_duplicates 
  ON all_duplicates.id = grouped_duplicates.id 
  WHERE grouped_duplicates.id IS NULL
)

How to delete duplicates on a MySQL table?

Deleting duplicate rows in MySQL in-place, (Assuming you have a timestamp col to sort by) walkthrough:

For those of you without a timestamp or unique column.

Related

Recent Posts