Euclidean distance vs Pearson correlation vs cosine similarity?
Solution 1:
Pearson correlation and cosine similarity are invariant to scaling, i.e. multiplying all elements by a nonzero constant. Pearson correlation is also invariant to adding any constant to all elements. For example, if you have two vectors X1 and X2, and your Pearson correlation function is called pearson()
, pearson(X1, X2) == pearson(X1, 2 * X2 + 3)
. This is a pretty important property because you often don't care that two vectors are similar in absolute terms, only that they vary in the same way.
Solution 2:
The difference between Pearson Correlation Coefficient and Cosine Similarity can be seen from their formulas:
The reason Pearson Correlation Coefficient is invariant to adding any constant is that the means are subtracted out by construction. It is also easy to see that Pearson Correlation Coefficient and Cosine Similarity are equivalent when X
and Y
have means of 0
, so we can think of Pearson Correlation Coefficient as demeaned version of Cosine Similarity.
For practical usage, let's consider returns of the two assets x
and y
:
In [275]: pylab.show()
In [276]: x = np.array([0.1, 0.2, 0.1, -0.1, 0.5])
In [277]: y = x + 0.1
These asset's returns have exactly the same variability, which is measured by Pearson Correlation Coefficient (1), but they are not exactly similar which is measured by cosine similarity (0.971).
In [281]: np.corrcoef([x, y])
Out[281]:
array([[ 1., 1.], # The off diagonal are correlations
[ 1., 1.]]) # between x and y
In [282]: from sklearn.metrics.pairwise import cosine_similarity
In [283]: cosine_similarity(x, z)
Out[283]: array([[ 0.97128586]])
Solution 3:
In addition to @dsimcha's answer, the cosine similarities of a subset of the original data are the same as that of the original data, which is not true for the Pearson correlation. This can be useful when clustering subsets of your data: they are (topologically) identical to the original clustering, so they can be more easily visualized and interpreted