How to do n-D distance and nearest neighbor calculations on numpy arrays

1. All Distances

  • only using numpy

The naive method is:

D = np.sqrt(np.sum((X[:, None, :] - Y[None, :, :])**2, axis = -1))

However this takes up a lot of memory creating an (i, j, n)-shaped intermediate matrix, and is very slow

However, thanks to a trick from @Divakar (eucl_dist package, wiki), we can use a bit of algebra and np.einsum to decompose as such: (X - Y)**2 = X**2 - 2*X*Y + Y**2

D = np.sqrt(                                #  (X - Y) ** 2   
np.einsum('ij, ij ->i', X, X)[:, None] +    # = X ** 2        \
np.einsum('ij, ij ->i', Y, Y)          -    # + Y ** 2        \
2 * X.dot(Y.T))                             # - 2 * X * Y
  • Y is X

Similar to above:

XX = np.einsum('ij, ij ->i', X, X)
D = np.sqrt(XX[:, None] + XX - 2 * X.dot(X.T))

Beware that floating-point imprecision can make the diagonal terms deviate very slightly from zero with this method. If you need to make sure they are zero, you'll need to explicitly set it:

np.einsum('ii->i', D)[:] = 0 
  • Any Package

scipy.spatial.distance.cdist is the most intuitive builtin function for this, and far faster than bare numpy

from scipy.spatial.distance import cdist
D = cdist(X, Y)

cdist can also deal with many, many distance measures as well as user-defined distance measures (although these are not optimized). Check the documentation linked above for details.

  • Y is X

For self-referring distances, scipy.spatial.distance.pdist works similar to cdist, but returns a 1-D condensed distance array, saving space on the symmetric distance matrix by only having each term once. You can convert this to a square matrix using squareform

from scipy.spatial.distance import pdist, squareform
D_cond = pdist(X)
D = squareform(D_cond)

2. K Nearest Neighbors (KNN)

  • Only using numpy

We could use np.argpartition to get the k-nearest indices and use those to get the corresponding distance values. So, with D as the array holding the distance values obtained above, we would have -

if k == 1:
    k_i = D.argmin(0)
else:
    k_i = D.argpartition(k, axis = 0)[:k]
k_d = np.take_along_axis(D, k_i, axis = 0)

However we can speed this up a bit by not taking the square roots until we have reduced our dataset. np.sqrt is the slowest part of calculating the Euclidean norm, so we don't want to do that until the end.

D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
       np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
    k_i = D_sq.argmin(0)
else:
    k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d = np.sqrt(np.take_along_axis(D_sq, k_i, axis = 0))

Now, np.argpartition performs indirect partition and doesn't necessarily give us the elements in sorted order and only makes sure that the first k elements are the smallest ones. So, for a sorted output, we need to use argsort on the output from previous step -

sorted_idx = k_d.argsort(axis = 0)
k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
k_d_sorted = np.take_along_axis(k_d, sorted_idx, axis = 0)

If you only need, k_i, you never need the square root at all:

D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
       np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
    k_i = D_sq.argmin(0)
else:
    k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d_sq = np.take_along_axis(D_sq, k_i, axis = 0)
sorted_idx = k_d_sq.argsort(axis = 0)
k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
  • X is Y

In the above code, replace:

D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
       np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)

with:

XX = np.einsum('ij, ij ->i', X, X)
D_sq = XX[:, None] + XX - 2 * X.dot(X.T))
  • Any Package

KD-Tree is a much faster method to find neighbors and constrained distances. Be aware the while KDTree is usually much faster than brute force solutions above for 3d (as long as oyu have more than 8 points), if you have n-dimensions, KDTree only scales well if you have more than 2**n points. For discussion and more advanced methods for high dimensions, see Here

The most recommended method for implementing KDTree is to use scipy's scipy.spatial.KDTree or scipy.spatial.cKDTree

from scipy.spatial.distance import KDTree
X_tree = KDTree(X)
k_d, k_i = X_tree.query(Y, k = k)

Unfortunately scipy's KDTree implementation is slow and has a tendency to segfault for larger data sets. As pointed out by @HansMusgrave here, pykdtree increases the performance a lot, but is not as common an include as scipy and can only deal with Euclidean distance currently (while the KDTree in scipy can handle Minkowsi p-norms of any order)

  • X is Y

Use instead:

k_d, k_i = X_tree.query(X, k = k)
  • Arbitrary metrics

A BallTree has similar algorithmic properties to a KDTree. I'm not aware of a parallel/vectorized/fast BallTree in Python, but using scipy we can still have reasonable KNN queries for user-defined metrics. If available, builtin metrics will be much faster.

def d(a, b):
    return max(np.abs(a-b))

tree = sklearn.neighbors.BallTree(X, metric=d)
k_d, k_i = tree.query(Y)

This answer will be wrong if d() is not a metric. The only reason a BallTree is faster than brute force is because the properties of a metric allow it to rule out some solutions. For truly arbitrary functions, brute force is actually necessary.

3. Radius search

  • Only using numpy

The simplest method is just to use boolean indexing:

mask = D_sq < r**2
r_i, r_j = np.where(mask)
r_d = np.sqrt(D_sq[mask])
  • Any Package

Similar to above, you can use scipy.spatial.KDTree.query_ball_point

r_ij = X_tree.query_ball_point(Y, r = r)

or scipy.spatial.KDTree.query_ball_tree

Y_tree = KDTree(Y)
r_ij = X_tree.query_ball_tree(Y_tree, r = r)

Unfortunately r_ij ends up being a list of index arrays that are a bit difficult to untangle for later use.

Much easier is to use cKDTree's sparse_distance_matrix, which can output a coo_matrix

from scipy.spatial.distance import cKDTree
X_cTree = cKDTree(X)
Y_cTree = cKDTree(Y)
D_coo = X_cTree.sparse_distance_matrix(Y_cTree, r = r, output_type = `coo_matrix`)
r_i = D_coo.row
r_j = D_coo.column
r_d = D_coo.data

This is an extraordinarily flexible format for the distance matrix, as it stays an actual matrix (if converted to csr) can also be used for many vectorized operations.