Cosine similarity vs angular distance
While checking Google's Universal sentence encoder paper, I found that they mention that using a similarity based on angular distance performs better than raw cosine similarity.
More specifically, they used the following formula for the similarity:
sim(u.v) = (1 - arccos(cosine_similarity(u, v)) / pi)
with:
cosine_similarity(u, v) = (u . v) / (||u|| . ||v||)
Can anyone provide a mathematical intuition as to why the first similarity metric performs better than the second?
Solution 1:
As Somos already answered, the problem with the cosine similarity is that small angles have very similar cosines. (In the diagram further below, compare the curves near the top edge.)
Let the two vectors are $\mathbf{u}$ and $\mathbf{v}$. Then, $$\mathbf{u} \cdot \mathbf{v} = \left\lVert \mathbf{u} \right\rVert \left\lVert \mathbf{v} \right\rVert \cos(\varphi) \quad \iff \quad x = \cos(\varphi) = \frac{\mathbf{u} \cdot \mathbf{v}}{\left\lVert \mathbf{u} \right\rVert \left\lVert \mathbf{v} \right\rVert} \tag{1}\label{NA1}$$ i.e., $\varphi$ is the angle between the two vectors, and $x$ is the cosine of that angle.
In the Google paper OP linked to, a function $f(x) = f\left(\cos(\varphi)\right)$ is used to map the (cosine of the) angle between the two vectors to "similarity".
Note that this function $f$ is what defines the meaning of "similarity" in that paper. Any function $f(x)$ that fulfills $f(-1)=0$, $f(1)=1$, and is nondecreasing for $-1 \le x \le 1$ can be used, because then "similarity" is closely related to both the dot product and the angle between the two vectors, only "weighted" to emphasize some values more than others.
(The nondecreasing part boils down to the fact that the methods used in the paper are designed for vectors $\mathbf{u}$, $\mathbf{v}$ where "similarity" increases as $x = \mathbf{u} \cdot \mathbf{v} / (\lVert\mathbf{u}\rVert \lVert\mathbf{v}\rVert)$ increases. We do not want mappings that break that, only emphasize or de-emphasize some ranges of values, so that the definition of "similarity" is useful in practice.)
As OP stated, in the paper the mapping is $$f(x) = 1 - \frac{\arccos(x)}{\pi} \quad \iff \quad f_\varphi(\varphi) = f\left(\cos(\varphi)\right) = 1 - \frac{\varphi}{180°}$$ Above, the left plot shows various "similarity" mappings as a function of $x = \cos(\varphi)$, and the right plot the same mappings as a function of the angle $\varphi$ (in degrees) between the vectors. (Left one shows various $f(x)$ curves, and the right one their corresponding $f_\varphi(\varphi) = f(\cos(\varphi))$ curves.)
The similarity mapping function formulae used in the above plots are $$\begin{array}{l|l|l} \; & f(x) & f_\varphi(\varphi) \\ \hline \text{Cosine} & f(x) = (x+1)/2 & f_\varphi(\varphi) = (\cos(\varphi) + 1)/2 \\ \text{Angular} & f(x) = 1 - \arccos(x)/\pi & f_\varphi(\varphi) = 1 - \frac{\varphi}{180°} \\ \text{cubic} & f(x) = \frac{1}{2} + \frac{1}{4}x + \frac{1}{4}x^3 & f_\varphi(\varphi) = \frac{1}{2} + \frac{1}{4}\cos(\varphi) + \frac{1}{4}\left(\cos(\varphi)\right)^3 \\ \text{square root} & f(x) = 1 - \sqrt{\frac{1-x}{2}} & f_\varphi(\varphi) = 1 - \sqrt{\frac{1 - \cos(\varphi)}{2}} \\ \end{array}$$
Let's examine how these mappings behave (compared to the others).
Cosine similarity ($f(x)=(x + 1)/2$) makes nearly parallel vectors very "similar". Note that this is essentially $f(x) = x$, except scaled and translated to give $f(-1)=0$ and $f(1)=1$. If you look at the blue curve in the right side plot, you can see that small angles $\varphi$ all produce almost the same "similarity".
Angular similarity ($f(x)=1-\arccos(x)/\pi$) distinguishes nearly parallel vectors much better: small changes in $x$ near $x = 1$ yields clearly different "similarity" values, as seen in the left side plot.
-
The cubic mapping function ($f(x) = (2 + x + x^3)/4$) is roughly between the two.
I included this one to show that the mapping function could just as well be a polynomial, especially because trigonometric functions tend to be slow compared to multiplication and addition.
The square root mapping function ($f(x) = 1 - \sqrt{(1-x)/2}$) emphasizes small differences even more than angular similarity for all high "similarity" values, with perpendicular vectors only having about $0.293$ "similarity" (compared to $0.5$ for all other mapping functions shown). It is also non-symmetric, because there really is no pressing reason for it to be. (If the similarity mapping function is symmetric, then negating one of the vectors also negates the "similarity". Here, negating a vector, a sentence embedding, really makes no sense.)
Picking $f(x) = 1 - \arccos(x)/\pi$ as the "similarity" mapping function was an easy, obvious choice by the authors of the referred to paper: it is easy to describe (35 or so words needed in the paper, including the footnote!), very intuitive, and according to the paper, sufficiently emphasizes non-"similarity" for vector pairs having a small angle, compared to the cosines of their angles.
That said, I do wonder if $f(x) = 1 - \sqrt{(1 - x)/2}$ would perform even better, or if it would emphasize differences between very similar sentences too much.
Solution 2:
The problem with the cosine is that when the angle between two vectors is small, the cosine of the angle is very close to $1$ and you lose precision. An identity for this is $\ 1 - \cos(x) = 2 \sin^2(x/2). \ $ If you try this with fixed precision numbers, the left side loses precision but the right side does not. Depending on the context, you want the measure to depend linearly on the angle, at least for small angles. That is one reason why the raw cosine is not a good measure.
Note that because of the importance of the function $\,1-\cos(x)\,$ it is named as versine and even more useful is the haversine function. The Wikipedia article states:
The haversine, in particular, was important in navigation because it appears in the haversine formula