Estimating distance between two servers by IP address
I was reading the GFS (Google File System) paper, and there's a statement:
Our network topology is simple enough that "distances" can be accurately estimated from IP addresses
I was curious about how this might actually work in practice and would appreciate it if anyone had any ideas.
That's called an IP numbering plan and would be similar to how POTS telephone numbers have a country code prefix, an area code and a local subscriber number and are not completely randomly assigned.
You can then infer similar information, the same area code is in the same DC, an area code +1 or -2 is still nearby or possibly a different quadrant in the same DC and a large off-set in area codes or a different country code is almost certainly long distance.
Which is much more than "everything within the same subnet must be nearby" that would be the only safe assumption to make otherwise, without consulting routing tables.
For instance when ip-address ranges are assigned on a as needed policy, you could get that 10.1.2.0/24 ends in up Ireland, 10.1.3.0/24 in Hong Kong, 10.1.4.0/24 ends in up Los Angeles, with 10.1.5.0/24 in San Francisco nearby and 10.1.6.0/24 bringing you all the way to Sydney.
It looks as though the paper was referring to an addressing scheme the organization itself controls, hence they know exactly where specific IP blocks are being used. They probably have something akin to a list of IP blocks that correspond to a given datacenter and can therefore correlate the source and dest IP addresses to netblocks, datacenters and physical addresses or GPS coords.
You could also do something similar with any IP addresses. There are several Geo-IP services that can geolocate a given IP address to a city/state, zip code, or GPS coords. Obviously there is a margin of error since those services have no way to know exactly where an IP address is in use, but for the most part they'll get you close. If you're trying to calculate distance between a server in California and one in NYC, then for the most part the error margin won't amount to much over that much distance, but if you're trying to calculate distance between an IP in North Florida and another one in Southern Georgia then your result probably won't be nearly as accurate.
Another pitfall to the public Geo-IP services is they have no idea what the internal network structure looks like, so for instance if an organization headquartered in Chicago backhauls all of their branch office traffic over their private network then you might see physical locations in Arizona with what appears to be a Chicago IP.