Apache Spark Python Cosine Similarity over DataFrames
You can use the built-in columnSimilarities()
method on a RowMatrix
, that can both calculate the exact cosine similarities, or estimate it using the DIMSUM method, which will be considerably faster for larger datasets. The difference in usage is that for the latter, you'll have to specify a threshold
.
Here's a small reproducible example:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
MatrixEntry(1, 2, 0.998441152599),
MatrixEntry(0, 1, 0.997463284056)]