How to delete duplicate rows without unique identifier
I like @erwin-brandstetter 's solution, but wanted to show a solution with the USING
keyword:
DELETE FROM table_with_dups T1
USING table_with_dups T2
WHERE T1.ctid < T2.ctid -- delete the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;
If you want to review the records before deleting them, then simply replace DELETE
with SELECT *
and USING
with a comma ,
, i.e.
SELECT * FROM table_with_dups T1
, table_with_dups T2
WHERE T1.ctid < T2.ctid -- select the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;
Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...)
clause as those generate a lot of rows in the subquery.
If you rewrite the query to use IN (...)
then it performs similarly to the solution presented here, but the SQL code becomes much less concise.
Update 2: If you have NULL
values in one of the key columns (which you really shouldn't IMO), then you can use COALESCE()
in the condition for that column, e.g.
AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]')
If you have no other unique identifier, you can use ctid
:
delete from mytable
where exists (select 1
from mytable t2
where t2.name = mytable.name and
t2.address = mytable.address and
t2.zip = mytable.zip and
t2.ctid > mytable.ctid
);
It is a good idea to have a unique, auto-incrementing id in every table. Doing a delete
like this is one important reason why.
In a perfect world, every table has a unique identifier of some sort.
In the absence of any unique column (or combination thereof), use the ctid
column:
- In-order sequence generation
- How do I decompose ctid into page and row numbers?
DELETE FROM tbl
WHERE ctid NOT IN (
SELECT min(ctid) -- ctid is NOT NULL by definition
FROM tbl
GROUP BY name, address, zipcode); -- list columns defining duplicates
The above query is short, conveniently listing column names only once. NOT IN (SELECT ...)
is a tricky query style when NULL values can be involved, but the system column ctid
is never NULL. See:
- Find records where join doesn't exist
Using EXISTS
as demonstrated by @Gordon is typically faster. So is a self-join with the USING
clause like @isapir added later. Both should result in the same query plan.
Important difference: These other queries treat NULL values as not equal, while GROUP BY
(or DISTINCT
or DISTINCT ON ()
) treats NULL values as equal. Does not matter for columns defined NOT NULL
. Else, depending on your definition of "duplicate", you'll need one approach or the other. Or use IS NOT DISTINCT FROM
to compare values (which may exclude some indexes).
Disclaimer:
ctid
is an implementation detail of Postgres, it's not in the SQL standard and can change between major versions without warning (even if that's very unlikely). Its values can change between commands due to background processes or concurrent write operations (but not within the same command).
Related:
-
How do I (or can I) SELECT DISTINCT on multiple columns?
-
How to use the physical location of rows (ROWID) in a DELETE statement
Aside:
The target of a DELETE
statement cannot be the CTE, only the underlying table. That's a spillover from SQL Server - as is your whole approach.
Here is what I came up with, using a group by
DELETE FROM mytable
WHERE id NOT in (
SELECT MIN(id)
FROM mytable
GROUP BY name, address, zipcode
)
It deletes the duplicates, preserving the oldest record that has duplicates.