Delete "almost duplicate" rows based on coordinate proximity
I have a table with names of towns, municipality, county and coordinates. Unfortunately, some towns have "near duplicates", i.e. there exists another row with the same name, municipality, county and coordinates very near the first row. Some even have more than one "near duplicate".
How can I remove all but one of these kinds of rows?
I know I can use a CTE with ROW_NUMBER() OVER(PARTITION BY name, municipality, county, latitude, longitude)
to delete exact duplicates. But how can I check if the latitude and longitude are, say, within 0.005
(=roughly 500-600 meters with lat/lng in decimal degrees) of each other?
Sample data
ID Name Municipality County Lat Lng
------------------------------------------------------------
1 Springfield Simpsonville Homer 12.34567 89.01234
2 Springfield Simpsonville Homer 12.35000 89.01200
3 Springfield Simpsonville Homer 12.00000 89.00000
4 AnotherTown AnotherVille Bart 12.34567 89.01234
Since 2
has the same name, municipality and county as 1
, and is within 0.005
decimal degrees in both latitude and longitude of 1
, it is considered a duplicate and should be deleted.
3
on the other hand is not within 0.005
decimal degrees of 1
, so it should not be considered a duplicate of 1
.
Expected result
ID Name Municipality County Lat Lng
--------------------------------------------------------
1 Springfield Simpsonville Homer 12.34567 89.01234
3 Springfield Simpsonville Homer 12.00000 89.00000
4 AnotherTown AnotherVille Bart 12.34567 89.01234
Solution 1:
You can use EXISTS
in the DELETE
statement:
DELETE FROM tablename AS t1
WHERE EXISTS (
SELECT 1
FROM tablename t2
WHERE (t2.Name, t2.Municipality, t2.County) = (t1.Name, t1.Municipality, t1.County)
AND ABS(t2.Lat - t1.Lat) <= 0.005 AND ABS(t2.Lng - t1.Lng) <= 0.005
AND t2.ID < t1.ID
);
See the demo.