Delete "almost duplicate" rows based on coordinate proximity

I have a table with names of towns, municipality, county and coordinates. Unfortunately, some towns have "near duplicates", i.e. there exists another row with the same name, municipality, county and coordinates very near the first row. Some even have more than one "near duplicate".

How can I remove all but one of these kinds of rows?

I know I can use a CTE with ROW_NUMBER() OVER(PARTITION BY name, municipality, county, latitude, longitude) to delete exact duplicates. But how can I check if the latitude and longitude are, say, within 0.005 (=roughly 500-600 meters with lat/lng in decimal degrees) of each other?

Sample data

ID  Name         Municipality   County  Lat         Lng
------------------------------------------------------------
1   Springfield  Simpsonville   Homer   12.34567    89.01234
2   Springfield  Simpsonville   Homer   12.35000    89.01200
3   Springfield  Simpsonville   Homer   12.00000    89.00000
4   AnotherTown  AnotherVille   Bart    12.34567    89.01234

Since 2 has the same name, municipality and county as 1, and is within 0.005 decimal degrees in both latitude and longitude of 1, it is considered a duplicate and should be deleted. 3 on the other hand is not within 0.005 decimal degrees of 1, so it should not be considered a duplicate of 1.

Expected result

ID  Name         Municipality   County  Lat         Lng
--------------------------------------------------------
1   Springfield  Simpsonville   Homer   12.34567    89.01234
3   Springfield  Simpsonville   Homer   12.00000    89.00000
4   AnotherTown  AnotherVille   Bart    12.34567    89.01234

Solution 1:

You can use EXISTS in the DELETE statement:

DELETE FROM tablename AS t1
WHERE EXISTS (
  SELECT 1 
  FROM tablename t2
  WHERE (t2.Name, t2.Municipality, t2.County) = (t1.Name, t1.Municipality, t1.County)
    AND ABS(t2.Lat - t1.Lat) <= 0.005 AND ABS(t2.Lng - t1.Lng) <= 0.005
    AND t2.ID < t1.ID
);

See the demo.

Delete "almost duplicate" rows based on coordinate proximity

Sample data

Expected result

Solution 1:

Related

Recent Posts