Removing duplicate rows (based on values from multiple columns) from SQL table

I have following SQL table:

AR_Customer_ShipTo

+--------------+------------+-------------------+------------+
| ARDivisionNo | CustomerNo |   CustomerName    | ShipToCode |
+--------------+------------+-------------------+------------+
|           00 | 1234567    | Test Customer     |          1 |
|           00 | 1234567    | Test Customer     |          2 |
|           00 | 1234567    | Test Customer     |          3 |
|           00 | ARACODE    | ARACODE Customer  |          1 |
|           00 | ARACODE    | ARACODE Customer  |          2 |
|           01 | CBE1EX     | Normal Customer   |          1 |
|           02 | ZOCDOC     | Normal Customer-2 |          1 |
+--------------+------------+-------------------+------------+

(ARDivisionNo, CustomerNo,ShipToCode) form a primary key for this table.

If you notice first 3 rows belong to same customer (Test Customer), who has different ShipToCodes: 1, 2 and 3. Similar is the case with second customer (ARACODE Customer). Each of Normal Customer and Normal Customer-2 has only 1 record with a single ShipToCode.

Now, I would like to get result querying on this table, where I will have only 1 record per customer. So, for any customer, where there are more than 1 records, I would like to keep the record with highest value for ShipToCode.

I tried various things:

(1) I can easily get the list of customers with only one record in table.

(2) With following query, I am able to get the list of all the customers, who have more than one record in the table.

[Query-1]

SELECT ARDivisionNo, CustomerNo
FROM AR_Customer_ShipTo 
GROUP BY ARDivisionNo, CustomerNo
HAVING COUNT(*) > 1;

(3) Now, in order to select proper ShipToCode for each record returned by above query, I am not able to figure out, how to iterate through all the records returned by above query.

If I do something like:

[Query-2]

SELECT TOP 1 ARDivisionNo, CustomerNo, CustomerName, ShipToCode  
FROM AR_Customer_ShipTo 
WHERE ARDivisionNo = '00' and CustomerNo = '1234567'
ORDER BY ShipToCode DESC

Then I can get the appropriate record for (00-1234567-Test Customer). Hence, if I can use all the results from query-1 in the above query (query-2), then I can get the desired single records for customers with more than one record. This can be combined with results from point (1) to achieve the desired end result.

Again, this can be easier than approach I am following. Please let me know how can I do this.

[Note: I have to do this using SQL queries only. I cannot use stored procedures, as I am going to execute this thing finally using 'Scribe Insight', which only allows me to write queries.]

Sample SQL FIDDLE

1) Use CTE to get max ship code value record based on ARDivisionNo, CustomerNo for each Customers

WITH cte AS (
  SELECT*, 
     row_number() OVER(PARTITION BY ARDivisionNo, CustomerNo ORDER BY ShipToCode desc) AS [rn]
  FROM t
)
Select * from cte WHERE [rn] = 1

2) To Delete the record use Delete query instead of Select and change Where Clause to rn > 1. Sample SQL FIDDLE

WITH cte AS (
  SELECT*, 
     row_number() OVER(PARTITION BY ARDivisionNo, CustomerNo ORDER BY ShipToCode desc) AS [rn]
  FROM t
)
Delete from cte WHERE [rn] > 1;

select * from t;

ROW_NUMBER() is great for this:

;WITH cte AS (SELECT *,ROW_NUMBER() OVER(PARTITION BY ARDivisionNo,CustomerNo ORDER BY ShipToCode DESC) AS RN 
              FROM AR_Customer_ShipTo
              )
SELECT * 
FROM  cte
WHERE RN = 1

You mention removing the duplicates, if you want to DELETE you can simply:

;WITH cte AS (SELECT *,ROW_NUMBER() OVER(PARTITION BY ARDivisionNo,CustomerNo ORDER BY ShipToCode DESC) AS RN 
              FROM AR_Customer_ShipTo
              )
DELETE cte
WHERE RN > 1

The ROW_NUMBER() function assigns a number to each row. PARTITION BY is optional, but used to start the numbering over for each value in a given field or group of fields, ie: if you PARTITION BY Some_Date then for each unique date value the numbering would start over at 1. ORDER BY of course is used to define how the counting should go, and is required in the ROW_NUMBER() function.

You didn't specify the version of SQL Server, but ROW_NUMBER is probably supported:

select *
from
 (
  select ...
     ,row_number() 
      over (partition by ARDivisionNo, CustomerNo
            order by ShipToCode desc) as rn 
  from tab
 ) as dt
where rn = 1

With row_number function:

SELECT * FROM(
              SELECT ARDivisionNo, CustomerNo, CustomerName, ShipToCode,
              row_number() over(partition by CustomerNo order by ShipToCode desc) rn
              FROM AR_Customer_ShipTo) t
WHERE rn = 1

Removing duplicate rows (based on values from multiple columns) from SQL table

Related

Recent Posts