Improve computational speed for chunk-wise distance calculation

Distance calculation is a common problem, so it can be a good idea to use the available functions for that, specifically sklearn. The data you provided is not convenient to manage, but the example below might give ideas on how to adapt this workflow to the specifics of your data:

import numpy as np
import pandas as pd
from sklearn.metrics import pairwise_distances

X = pd.DataFrame(np.random.rand(10, 30))
Y = pd.DataFrame(np.random.rand(20, 30))


def custom_distance(x, y):
    """Sample asymmetric function."""
    return max(x) + min(y)

# use n_jobs=-1 to run calculations with all cores
result = pairwise_distances(X, Y, metric=custom_distance, n_jobs=-1)

To complete @SultanOrazbayev:

from sklearn.metrics import pairwise_distances


Ax = df_sample['CalVec'] = df_sample['CalVec'].apply(lambda x: np.array(x))
Bx = DF['MeasVec'] = DF['MeasVec'].apply(lambda x: np.array(x))

A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.stack(A)
BB = np.stack(B)

result = pairwise_distances(AA, BB, metric=custom_distance, n_jobs=-1)

which is performed in under 3 minutes.

Improve computational speed for chunk-wise distance calculation

Related

Recent Posts