Improve computational speed for chunk-wise distance calculation
Distance calculation is a common problem, so it can be a good idea to use the available functions for that, specifically sklearn
. The data you provided is not convenient to manage, but the example below might give ideas on how to adapt this workflow to the specifics of your data:
import numpy as np
import pandas as pd
from sklearn.metrics import pairwise_distances
X = pd.DataFrame(np.random.rand(10, 30))
Y = pd.DataFrame(np.random.rand(20, 30))
def custom_distance(x, y):
"""Sample asymmetric function."""
return max(x) + min(y)
# use n_jobs=-1 to run calculations with all cores
result = pairwise_distances(X, Y, metric=custom_distance, n_jobs=-1)
To complete @SultanOrazbayev:
from sklearn.metrics import pairwise_distances
Ax = df_sample['CalVec'] = df_sample['CalVec'].apply(lambda x: np.array(x))
Bx = DF['MeasVec'] = DF['MeasVec'].apply(lambda x: np.array(x))
A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.stack(A)
BB = np.stack(B)
result = pairwise_distances(AA, BB, metric=custom_distance, n_jobs=-1)
which is performed in under 3 minutes.