How to select unique records by SQL
With the distinct
keyword with single and multiple column names, you get distinct records:
SELECT DISTINCT column 1, column 2, ...
FROM table_name;
There are 4 methods you can use:
- DISTINCT
- GROUP BY
- Subquery
- Common Table Expression (CTE) with ROW_NUMBER()
Consider the following sample TABLE
with test data:
/** Create test table */
CREATE TEMPORARY TABLE dupes(word text, num int, id int);
/** Add test data with duplicates */
INSERT INTO dupes(word, num, id)
VALUES ('aaa', 100, 1)
,('bbb', 200, 2)
,('ccc', 300, 3)
,('bbb', 400, 4)
,('bbb', 200, 5) -- duplicate
,('ccc', 300, 6) -- duplicate
,('ddd', 400, 7)
,('bbb', 400, 8) -- duplicate
,('aaa', 100, 9) -- duplicate
,('ccc', 300, 10); -- duplicate
Option 1: SELECT DISTINCT
This is the most simple and straight forward, but also the most limited way:
SELECT DISTINCT word, num
FROM dupes
ORDER BY word, num;
/*
word|num|
----|---|
aaa |100|
bbb |200|
bbb |400|
ccc |300|
ddd |400|
*/
Option 2: GROUP BY
Grouping allows you to add aggregated data, like the min(id)
, max(id)
, count(*)
, etc:
SELECT word, num, min(id), max(id), count(*)
FROM dupes
GROUP BY word, num
ORDER BY word, num;
/*
word|num|min|max|count|
----|---|---|---|-----|
aaa |100| 1| 9| 2|
bbb |200| 2| 5| 2|
bbb |400| 4| 8| 2|
ccc |300| 3| 10| 3|
ddd |400| 7| 7| 1|
*/
Option 3: Subquery
Using a subquery, you can first identify the duplicate rows to ignore, and then filter them out in the outer query with the WHERE NOT IN (subquery)
construct:
/** Find the higher id values of duplicates, distinct only added for clarity */
SELECT distinct d2.id
FROM dupes d1
INNER JOIN dupes d2 ON d2.word=d1.word AND d2.num=d1.num
WHERE d2.id > d1.id
/*
id|
--|
5|
6|
8|
9|
10|
*/
/** Use the previous query in a subquery to exclude the dupliates with higher id values */
SELECT *
FROM dupes
WHERE id NOT IN (
SELECT d2.id
FROM dupes d1
INNER JOIN dupes d2 ON d2.word=d1.word AND d2.num=d1.num
WHERE d2.id > d1.id
)
ORDER BY word, num;
/*
word|num|id|
----|---|--|
aaa |100| 1|
bbb |200| 2|
bbb |400| 4|
ccc |300| 3|
ddd |400| 7|
*/
Option 4: Common Table Expression with ROW_NUMBER()
In the Common Table Expression (CTE), select the ROW_NUMBER(), partitioned by the group column and ordered in the desired order. Then SELECT only the records that have ROW_NUMBER() = 1
:
WITH CTE AS (
SELECT *
,row_number() OVER(PARTITION BY word, num ORDER BY id) AS row_num
FROM dupes
)
SELECT word, num, id
FROM cte
WHERE row_num = 1
ORDER BY word, num;
/*
word|num|id|
----|---|--|
aaa |100| 1|
bbb |200| 2|
bbb |400| 4|
ccc |300| 3|
ddd |400| 7|
*/
If you only need to remove duplicates then use DISTINCT
. GROUP BY
should be used to apply aggregate operators to each group
GROUP BY v DISTINCT