Calculate distance between Zip Codes... AND users.
Solution 1:
Ok, for starters, you don't really need to use the Haversine formula here. For large distances where a less accurate formula produces a larger error, your users don't care if the match is plus or minus a few miles, and for closer distances, the error is very small. There are easier (to calculate) formulas listed on the Geographical Distance Wikipedia article.
Since zip codes are nothing like evenly spaced, any process that partitions them evenly is going to suffer mightily in areas where they are clustered tightly (east coast near DC being a good example). If you want a visual comparison, check out http://benfry.com/zipdecode and compare the zipcode prefix 89 with 07.
A far better way to deal with indexing this space is to use a data structure like a Quadtree or an R-tree. This structure allows you to do spatial and distance searches over data which is not evenly spaced.
Here's what an Quadtree looks like:
To search over it, you drill down through each larger cell using the index of smaller cells that are within it. Wikipedia explains it more thoroughly.
Of course, since this is a fairly common thing to do, someone else has already done the hard part for you. Since you haven't specified what database you're using, the PostgreSQL extension PostGIS will serve as an example. PostGIS includes the ability to do R-tree spatial indexes which allow you to do efficient spatial querying.
Once you've imported your data and built the spatial index, querying for distance is a query like:
SELECT zip
FROM zipcode
WHERE
geom && expand(transform(PointFromText('POINT(-116.768347 33.911404)', 4269),32661), 16093)
AND
distance(
transform(PointFromText('POINT(-116.768347 33.911404)', 4269),32661),
geom) < 16093
I'll let you work through the rest of the tutorial yourself.
- http://unserializableone.blogspot.com/2007/02/using-postgis-to-find-points-of.html
Here are some other references to get you started.
- http://www.bostongis.com/PrinterFriendly.aspx?content_name=postgis_tut02
- http://www.manning.com/obe/PostGIS_MEAPCH01.pdf
- http://postgis.refractions.net/docs/ch04.html
Solution 2:
I'd simply just create a zip_code_distances table and pre-compute the distances between all 42K zipcodes in the US which are within a 20-25 mile radius of each other.
create table zip_code_distances
(
from_zip_code mediumint not null,
to_zip_code mediumint not null,
distance decimal(6,2) default 0.0,
primary key (from_zip_code, to_zip_code),
key (to_zip_code)
)
engine=innodb;
Only including zipcodes within a 20-25 miles radius of each other reduces the number of rows you need to store in the distance table from it's maximum of 1.7 billion (42K ^ 2) - 42K to a much more manageable 4 million or so.
I downloaded a zipcode datafile from the web which contained the longitudes and latitudes of all the official US zipcodes in csv format:
"00601","Adjuntas","Adjuntas","Puerto Rico","PR","787","Atlantic", 18.166, -66.7236
"00602","Aguada","Aguada","Puerto Rico","PR","787","Atlantic", 18.383, -67.1866
...
"91210","Glendale","Los Angeles","California","CA","818","Pacific", 34.1419, -118.261
"91214","La Crescenta","Los Angeles","California","CA","818","Pacific", 34.2325, -118.246
"91221","Glendale","Los Angeles","California","CA","818","Pacific", 34.1653, -118.289
...
I wrote a quick and dirty C# program to read the file and compute the distances between every zipcode but only output zipcodes that fall within a 25 mile radius:
sw = new StreamWriter(path);
foreach (ZipCode fromZip in zips){
foreach (ZipCode toZip in zips)
{
if (toZip.ZipArea == fromZip.ZipArea) continue;
double dist = ZipCode.GetDistance(fromZip, toZip);
if (dist > 25) continue;
string s = string.Format("{0}|{1}|{2}", fromZip.ZipArea, toZip.ZipArea, dist);
sw.WriteLine(s);
}
}
The resultant output file looks as follows:
from_zip_code|to_zip_code|distance
...
00601|00606|16.7042215574185
00601|00611|9.70353520976393
00601|00612|21.0815707704904
00601|00613|21.1780461311929
00601|00614|20.101431539283
...
91210|90001|11.6815708119899
91210|90002|13.3915723402714
91210|90003|12.371251171873
91210|90004|5.26634939906721
91210|90005|6.56649623829871
...
I would then just load this distance data into my zip_code_distances table using load data infile and then use it to limit the search space of my application.
For example if you have a user whose zipcode is 91210 and they want to find people who are within a 10 mile radius of them then you can now simply do the following:
select
p.*
from
people p
inner join
(
select
to_zip_code
from
zip_code_distances
where
from_zip_code = 91210 and distance <= 10
) search
on p.zip_code = search.to_zip_code
where
p.gender = 'F'....
Hope this helps
EDIT: extended radius to 100 miles which increased the number of zipcode distances to 32.5 million rows.
quick performance check for zipcode 91210 runtime 0.009 seconds.
select count(*) from zip_code_distances
count(*)
========
32589820
select
to_zip_code
from
zip_code_distances
where
from_zip_code = 91210 and distance <= 10;
0:00:00.009: Query OK