DISTINCT for only one column

Let's say I have the following query.

SELECT ID, Email, ProductName, ProductModel FROM Products

How can I modify it so that it returns no duplicate Emails?

In other words, when several rows contain the same email, I want the results to include only one of those rows (preferably the last one). Duplicates in other columns should be allowed.

Clauses like DISTINCT and GROUP BY appear to work on entire rows. So I'm not sure how to approach this.


Solution 1:

If you are using SQL Server 2005 or above use this:

SELECT *
  FROM (
                SELECT  ID, 
                        Email, 
                        ProductName, 
                        ProductModel,
                        ROW_NUMBER() OVER(PARTITION BY Email ORDER BY ID DESC) rn
                    FROM Products
              ) a
WHERE rn = 1

EDIT: Example using a where clause:

SELECT *
  FROM (
                SELECT  ID, 
                        Email, 
                        ProductName, 
                        ProductModel,
                        ROW_NUMBER() OVER(PARTITION BY Email ORDER BY ID DESC) rn
                    FROM Products
                   WHERE ProductModel = 2
                     AND ProductName LIKE 'CYBER%'

              ) a
WHERE rn = 1

Solution 2:

This assumes SQL Server 2005+ and your definition of "last" is the max PK for a given email

WITH CTE AS
(
SELECT ID, 
       Email, 
       ProductName, 
       ProductModel, 
       ROW_NUMBER() OVER (PARTITION BY Email ORDER BY ID DESC) AS RowNumber 
FROM   Products
)
SELECT ID, 
       Email, 
       ProductName, 
       ProductModel
FROM CTE 
WHERE RowNumber = 1

Solution 3:

You can over that by using GROUP BY like this:

SELECT ID, Email, ProductName, ProductModel
FROM Products
GROUP BY Email

Solution 4:

When you use DISTINCT think of it as a distinct row, not column. It will return only rows where the columns do not match exactly the same.

SELECT DISTINCT ID, Email, ProductName, ProductModel
FROM Products

----------------------
1 | [email protected] | ProductName1 | ProductModel1
2 | [email protected] | ProductName1 | ProductModel1

The query would return both rows because the ID column is different. I'm assuming that the ID column is an IDENTITY column that is incrementing, if you want to return the last then I recommend something like this:

SELECT DISTINCT TOP 1 ID, Email, ProductName, ProductModel
FROM Products
ORDER BY ID DESC

The TOP 1 will return only the first record, by ordering it by the ID descending it will return the results with the last row first. This will give you the last record.