How to delete duplicate entries?
I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add.
What is the fastest approach to removing the offending rows? I have an SQL statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added?
Solution 1:
Some of these approaches seem a little complicated, and I generally do this as:
Given table table
, want to unique it on (field1, field2) keeping the row with the max field3:
DELETE FROM table USING table alias
WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND
table.max_field < alias.max_field
For example, I have a table, user_accounts
, and I want to add a unique constraint on email, but I have some duplicates. Say also that I want to keep the most recently created one (max id among duplicates).
DELETE FROM user_accounts USING user_accounts ua2
WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;
- Note -
USING
is not standard SQL, it is a PostgreSQL extension (but a very useful one), but the original question specifically mentions PostgreSQL.
Solution 2:
For example you could:
CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;