How to do n-D distance and nearest neighbor calculations on numpy arrays
1. All Distances
- only using
numpy
The naive method is:
D = np.sqrt(np.sum((X[:, None, :] - Y[None, :, :])**2, axis = -1))
However this takes up a lot of memory creating an (i, j, n)
-shaped intermediate matrix, and is very slow
However, thanks to a trick from @Divakar (eucl_dist
package, wiki), we can use a bit of algebra and np.einsum
to decompose as such: (X - Y)**2 = X**2 - 2*X*Y + Y**2
D = np.sqrt( # (X - Y) ** 2
np.einsum('ij, ij ->i', X, X)[:, None] + # = X ** 2 \
np.einsum('ij, ij ->i', Y, Y) - # + Y ** 2 \
2 * X.dot(Y.T)) # - 2 * X * Y
Y
isX
Similar to above:
XX = np.einsum('ij, ij ->i', X, X)
D = np.sqrt(XX[:, None] + XX - 2 * X.dot(X.T))
Beware that floating-point imprecision can make the diagonal terms deviate very slightly from zero with this method. If you need to make sure they are zero, you'll need to explicitly set it:
np.einsum('ii->i', D)[:] = 0
- Any Package
scipy.spatial.distance.cdist
is the most intuitive builtin function for this, and far faster than bare numpy
from scipy.spatial.distance import cdist
D = cdist(X, Y)
cdist
can also deal with many, many distance measures as well as user-defined distance measures (although these are not optimized). Check the documentation linked above for details.
Y
isX
For self-referring distances, scipy.spatial.distance.pdist
works similar to cdist
, but returns a 1-D condensed distance array, saving space on the symmetric distance matrix by only having each term once. You can convert this to a square matrix using squareform
from scipy.spatial.distance import pdist, squareform
D_cond = pdist(X)
D = squareform(D_cond)
2. K Nearest Neighbors (KNN)
- Only using
numpy
We could use np.argpartition
to get the k-nearest
indices and use those to get the corresponding distance values. So, with D
as the array holding the distance values obtained above, we would have -
if k == 1:
k_i = D.argmin(0)
else:
k_i = D.argpartition(k, axis = 0)[:k]
k_d = np.take_along_axis(D, k_i, axis = 0)
However we can speed this up a bit by not taking the square roots until we have reduced our dataset. np.sqrt
is the slowest part of calculating the Euclidean norm, so we don't want to do that until the end.
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
k_i = D_sq.argmin(0)
else:
k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d = np.sqrt(np.take_along_axis(D_sq, k_i, axis = 0))
Now, np.argpartition
performs indirect partition and doesn't necessarily give us the elements in sorted order and only makes sure that the first k
elements are the smallest ones. So, for a sorted output, we need to use argsort
on the output from previous step -
sorted_idx = k_d.argsort(axis = 0)
k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
k_d_sorted = np.take_along_axis(k_d, sorted_idx, axis = 0)
If you only need, k_i
, you never need the square root at all:
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
k_i = D_sq.argmin(0)
else:
k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d_sq = np.take_along_axis(D_sq, k_i, axis = 0)
sorted_idx = k_d_sq.argsort(axis = 0)
k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
X
isY
In the above code, replace:
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
with:
XX = np.einsum('ij, ij ->i', X, X)
D_sq = XX[:, None] + XX - 2 * X.dot(X.T))
- Any Package
KD-Tree is a much faster method to find neighbors and constrained distances. Be aware the while KDTree is usually much faster than brute force solutions above for 3d (as long as oyu have more than 8 points), if you have n
-dimensions, KDTree only scales well if you have more than 2**n
points. For discussion and more advanced methods for high dimensions, see Here
The most recommended method for implementing KDTree is to use scipy
's scipy.spatial.KDTree
or scipy.spatial.cKDTree
from scipy.spatial.distance import KDTree
X_tree = KDTree(X)
k_d, k_i = X_tree.query(Y, k = k)
Unfortunately scipy
's KDTree implementation is slow and has a tendency to segfault for larger data sets. As pointed out by @HansMusgrave here, pykdtree
increases the performance a lot, but is not as common an include as scipy
and can only deal with Euclidean distance currently (while the KDTree
in scipy
can handle Minkowsi p-norms of any order)
X
isY
Use instead:
k_d, k_i = X_tree.query(X, k = k)
- Arbitrary metrics
A BallTree has similar algorithmic properties to a KDTree. I'm not aware of a parallel/vectorized/fast BallTree in Python, but using scipy we can still have reasonable KNN queries for user-defined metrics. If available, builtin metrics will be much faster.
def d(a, b):
return max(np.abs(a-b))
tree = sklearn.neighbors.BallTree(X, metric=d)
k_d, k_i = tree.query(Y)
This answer will be wrong if d()
is not a metric. The only reason a BallTree is faster than brute force is because the properties of a metric allow it to rule out some solutions. For truly arbitrary functions, brute force is actually necessary.
3. Radius search
- Only using
numpy
The simplest method is just to use boolean indexing:
mask = D_sq < r**2
r_i, r_j = np.where(mask)
r_d = np.sqrt(D_sq[mask])
- Any Package
Similar to above, you can use scipy.spatial.KDTree.query_ball_point
r_ij = X_tree.query_ball_point(Y, r = r)
or scipy.spatial.KDTree.query_ball_tree
Y_tree = KDTree(Y)
r_ij = X_tree.query_ball_tree(Y_tree, r = r)
Unfortunately r_ij
ends up being a list of index arrays that are a bit difficult to untangle for later use.
Much easier is to use cKDTree
's sparse_distance_matrix
, which can output a coo_matrix
from scipy.spatial.distance import cKDTree
X_cTree = cKDTree(X)
Y_cTree = cKDTree(Y)
D_coo = X_cTree.sparse_distance_matrix(Y_cTree, r = r, output_type = `coo_matrix`)
r_i = D_coo.row
r_j = D_coo.column
r_d = D_coo.data
This is an extraordinarily flexible format for the distance matrix, as it stays an actual matrix (if converted to csr
) can also be used for many vectorized operations.