vector subset problem for linear approximation
Yo, big fan of your Yuno project.
I've read your Medium post, where the parts about the node and content indexes are the most important here I believe.
I will try to phrase the problem description in my own words to make sure I understand the problem correctly (Please correct me if I'm wrong or thinking in the wrong direction anywhere in the explanations that follow):
We have a set of vectors $V = \{v_{1}, ..., v_{k}\}$ in $\mathbb{R}^{n}$ and a target $t \in \mathbb{R}^{n}$ and an $m>1$ in $\mathbb{N}$. I will also assume that $v_{i}$ and $t$ are all unit vectors with respect to L2 norm (I'm unsure if this matters).
We try to find a subset $C \subset V$ with $|C|=m$ that maximises: $$o = \frac{1}{m}\sum_{i=1}^{m}y_{i}$$ with $y_{i}$ the i-th entry of vector $y$. Where $$y = argmin_{x}||Dx-t||$$ and where in this expression we denote by $D$ the $n\times m$ matrix induced by the vectors in $C$. (ordering doesn't matter, the sum $o$ of the argmin vector entries stays the same.)
To start of bluntly, I have no idea how to solve this problem efficiently. But I believe we can formulate another problem, that we can solve efficiently and that may even be better suited for your usecase.
First I will argue by an example why I don't think solving the previous mathematical problem is optimal for what Yuno is trying to do.
Let's choose very simple vectors and low dimension for the example:
$V = \{v_{1}, v_{2}, v_{3}\}$ with $v_{1} = (0, 1),\; v_{2} = (\sqrt{2}/2, \sqrt{2}/2),\; v_{3} = (1, 0)$ and let $t = v_{2}$ and $m = 2$.
In Yuno, this would mean that the query vector $t$ is exactly equal to the review vector $v_{2}$.
but now if we look at the possible subsets of V we get: $$C = \{v_{1}, v_{2}\} \rightarrow o = 1/2$$ $$C = \{v_{1}, v_{3}\} \rightarrow o = \sqrt{2}/2$$ $$C = \{v_{2}, v_{3}\} \rightarrow o = 1/2$$ So the best solution of this scheme does not include the review that is exactly like the query. I would argue that this is not what we want.
I would argue that we want an indexer that strikes a good balance between the node indexer (reviews that most closely resemble the query) and the content indexer (reviews that describe the query with the most diversity).
So how I would do this is build the subset $C$ step by step.
so we make $C_{1} \subset C_{2} \subset ... \subset C_{m} = C$ where $C_{i}$ contains $i$ vectors from $V$.
So first we maximize $o$ for $m=1$ and get $C_{1}$ (this will be the vector $v_{i}$ with maximal cosine similarity to $t$)
Then we maximize $o$ for $m=2$ where we use the vector from $C_{1}$ and add 1 new vector from $V$, so we just have to look at $k-1$ possibilities.
then we maximize $o$ for $m=3$ where we use the vectors from $C_{2}$ and add 1 new vector from $V$, so we look at $k-2$ possibilities. and we keep doing this until we reach $C_{m}$
So now we only had to look at less than $mk$ possibilities in total. A lot less than the ${k \choose m}$ possibilities in the previous algorithm.
Now I will argue with a small example why I think this way of adding vectors to C makes sense:
suppose we have $V = \{v_{1}, v_{2}, v_{3}\}$ with $v_{1} = (1, 0),\; v_{2} = (0.9, 0.44),\; v_{3} = (0, 1)$ and $t = (0.87, 0.5)$ and $m = 2$
So we have that the cosine similarity of $t$ with respect to $v_{2} > v_{1} > v_{3}$. So the node indexer would choose $v_{1}$ and $v_{2}$ as best reviews for the query. But since $v_{1}$ and $v_{2}$ lie at the same side of $t$, we have that $v_{1}$ does not give us new information about query $t$, if we already have the information that $v_{2}$ provides.
We want an algorithm that chooses review $v_{3}$ instead of $v_{1}$, since it contains the most information on top of $v_{2}$. and this is exactly what the previously outlined algorithm does:
It first constructs $C_{1} = \{v_{2}\}$.
then we look at possibilities for $C_{2}$, these are $\{v_{1}, v_{2}\}$ or $\{v_{2}, v_{3}\}$. We find: $$C = \{v_{1}, v_{2}\} \rightarrow o = 0.48$$ $$C = \{v_{2}, v_{3}\} \rightarrow o = 0.52$$ So our algorithm would choose the most new information review $v_{3}$ over the less new information review $v_{1}$, even though it has worse cosine similarity to $t$.
Ok, this ended up being longer than I anticipated. I hope it makes sense. And if it does, I'd love to hear if it ends up helping Yuno to give even better query results! If it doesn't make sense or I'm thinking in the wrong direction for your application, maybe we could talk on discord to work it out further.
Thanks again for building Yuno. I eagerly await your response.