Intuition on why the average minimizes the euclidean distance

I saw that there was a question with a very similar (if not identical) flavour to my question, but the answer was derived from derivative, the method that I already knew solved this problem.

I feel that it is "obvious" that the value that minimizes the sum of the euclidean distance from points, i.e. find a z that minimizes:

$$\sum^{k}_{i=1}\|x_i - z\|^2$$

I know the solution can be obtained with derivatives and that $z = \frac{\sum^k_{i=1} x_i}{k}$ but even before I tried solving it with derivatives, it seemed "obvious" that was the case and I felt that solving it using derivatives is the correct approach but seemed over doing it for the simple problem.

I was wondering if anyone had a intuitive argument for this solution. It just seemed so obvious and there is a way of doing it rigorously, but I was more interested if someone knows intuitively why that had to be solution. Maybe there isn't but I am just curious to know if someone had a alternative view for the problem/solution.

Thanks in advance!


Solution 1:

Here are two ways of viewing it. The second may be (for some people?) more "intuitive":

First way: $$ \sum_{i=1}^k (x_i - z)^2 = \sum_{i=1}^k \Big((x_i - m)^2 + 2(x_i-m)(m-z) + (m-z)^2\Big). $$

In the sum of the middle term, $\displaystyle\sum_{i=1}^k 2(x_i-m)(m-z)$, the factor $2(m-z)$ does not depend on the index $i$, i.e. does not change as $i$ goes from $1$ to $k$, hence this sum is $\displaystyle 2(m-z)\sum_{i=1}^k (x_i-m)$.

That sum is $0$ if and only if $m=\bar x = (x_1+\cdots+x_n)/n$.

In the last term, $\displaystyle\sum_{i=1}^k (m-z)^2$, the whole expression $(m-z)^2$ does not change as $i$ goes from $1$ to $k$, so it's a sum of $k$ terms that are all equal; hence it is $k(m-z)^2$.

Therefore $$ \sum_{i=1}^k (x_i-z)^2 = k(\bar x - z)^2 + \sum_{i=1}^k (x_i-\bar x)^2. $$ Since $z$ appears only in the first term of this last expression, the value of $z$ that minimizes that is the value of $z$ that minimizes the first term.

That's one way to show that the least-squares estimate of the population mean is the sample mean.

Second way:

But now let's look at it geometrically: $$ \begin{bmatrix} x_1 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} \bar x \\ \vdots \\ \bar x \end{bmatrix} + \begin{bmatrix} x_1 - \bar x \\ \vdots \\ x_n - \bar x \end{bmatrix}. $$ The first term on the right is the orthogonal projection of the vector on the left onto a certain one-dimensional subspace of $\mathbb R^n$. The second term on the right is the orthogonal projection of the same vector onto the complementary $(n-1)$-dimensional subspace. The vector in a subspace that is nearest to a vector not in the subspace, in terms of Euclidean distance, is the orthogonal projection onto the subspace.

Solution 2:

The following sketch captures one possible intuition for the situation.

  1. We may assume $z$ lies in the affine hull of the $x_i$. (Indeed, replacing $z$ with its image under orthogonal projection onto that affine subspace reduces each of the distances $\|x_i-z\|$.) That is, we assume $z = \sum_i \lambda_i x_i$, where $\sum_i \lambda_i = 1$.
  2. Each function $z\mapsto\|x_i-z\|^2$ is strictly convex (indeed, its graph is a paraboloid), so their sum $z\mapsto\sum_i\|x_i-z\|^2$ is also strictly convex.
  3. The objective function is symmetrical under permutation of the $x_i$: the candidate solutions $z=\sum_i\lambda_i x_i$ and $z^\sigma = \sum_i \lambda_{\sigma(i)} x_i$, where $\sigma$ is a permutation of $\{1,\dotsc,k\}$, have the same value of the objective function. If $z\ne z^\sigma$, then the point $\frac12(z+z^\sigma)$ is a better solution than $z$ and $z^\sigma$, since the objective function is strictly convex. So the optimal solution must be invariant under such permutations, that is, its $\lambda_i$ are all equal.