If there is a linear function $g$ which is at least as good of an approximation as any other linear $h$, then $f$ is differentiable at $x_0$.
This question is related to one in this question where the author asks what is the intuition behind saying that derivative is the best linear approximation. One of the answers by user "Milo Brandt" is that we have two theorems, one of which is: $f$ is differentiable at $x_0$ if and only if there is a linear function $g$ which is at least as good of an approximation as any other linear $h$.
I am struggling to prove one part of this theorem. First, I think that $g$ and $h$ are supposed to be affine and not linear in a sense that $g(x) = A + B(x-x_0)$ and $h(x) = C + D(x-x_0)$ where $A,C \in \mathbb{R}^m$ and $B,D : \mathbb{R}^n \to \mathbb{R}^m$ are linear functions.
Assume that there is such function $g$ which is at least as good as an approximation as any other $h$. By definition, this means that there exists $\delta > 0$ such that for all $x$ that have $|x - x_0 | < \delta$ we have $|f(x) - g(x)| \leq |h(x) - f(x)|$.
I would like to show that $f$ is differentiable at $x_0$, in other words, that there exists a linear function $\lambda : \mathbb{R}^n \to \mathbb{R}^m$ such that:
$$ \lim \limits_{x \to x_0} \frac{|f(x) - f(x_0) - \lambda(x-x_0)|}{|x-x_0|} = 0 $$
This translates to be able to find such $\lambda$ so that for each $\varepsilon > 0$ we can find $\delta > 0$ such that when $|x - x_0| < \delta$, we have $|f(x) - f(x_0) - \lambda(x-x_0)| < \varepsilon |x-x_0|$.
I think intuitively, I would like to show that $\lambda = B$ is correct choice. From $g$ being as good as approximation as any $h$, I have the following: there is $\delta > 0 $ so that for all $|x - x_0| < \delta$ I have $|f(x) - g(x)| \leq |C - f(x) + D(x-x_0)|$. Now, I can use triangle inequality and also result that I have already proven which is that any linear function $D$ is bounded in the following way: $|D(x-x_0)| < M|x-x_0|$.
In this case, I can show that I can always find $\delta > 0$ such that for all $|x-x_0| <\delta$ I have $|f(x) - g(x) | \leq |C-f(x)+D(x-x_0)| \leq |C - f(x)| + |D(x-x_0)| < |C - f(x)| + M|x-x_0|$. As $h$ is arbitrary, I could choose $M = \varepsilon$ as I also know that $M = \sqrt{mn}$ $ \mathrm{max}_{ij}|D_{ij}|$. But then I would only get that $|f(x) - g(x)| < |C - f(x)| + \varepsilon |x-x_0|$.
How to get rid of the second term? Should I use continuity? Do I somehow use that $g$ is linear now? Any help would be appreciated - thanks!
I'll be following a different path to the result. To simplify, I'll assume $x_0=0$ and $f(0)=0.$ Also I'll assume $m=1.$ So $f$ is real valued in some neighborhood of $0$ in $\mathbb R^n.$
Assume there exists a linear function $g:\mathbb R^n\to \mathbb R$ that is a best linear approximation to $f.$ That means that for any linear $h:\mathbb R^n\to \mathbb R$ there exists a neighborhood $U_h$ of $0$ such that
$$|f(x)-g(x)|\le |f(x)-h(x)| \text{ for all } x\in U_h.$$
We want to show $Df(0)=g.$ I.e.,
$$\tag 1 \frac{|f(x)-g(x)|}{|x|}\to 0$$
as $x\to 0$ through nonzero vectors.
Suppose $(1)$ fails. Then there is $\epsilon>0$ and a sequence $x_k\to 0$ such that
$$ |f(x_k)-g(x_k)|\ge \epsilon|x_k|$$
for all $k.$ It follows that for each $k,$ either i) $f(x_k)-g(x_k)\ge \epsilon|x_k|$ or ii) $f(x_k)-g(x_k)\le -\epsilon|x_k|.$ At least one of those holds for infinitely many $k;$ let's assume it's i). Instead of subsequence notation, I'll assume WLOG i) holds for all $k.$
Write $x_k=r_ku_k,$ where $r_k=|x_k|$ and $u_k= x_k/|x_k|.$ The $u_k$ are unit vectors, and since the unit sphere $S$ is compact, there exists a subsequence of $u_k$ that converges to some $u_0\in S.$ I'll continue to abuse notation and assum $u_k$ is this subsequence.
Define $L(tu_0) = t(\epsilon/2),t\in \mathbb R.$ Then extend $L$ to be linear on $\mathbb R^n$ in any way you like. Then $g+L$ is linear on $\mathbb R^n.$
Claim: There exists $K$ such that
$$0<f(x_k)-(g(x_k)+L(x_k))< f(x_k)-g(x_k)$$
for $k>K.$
If we prove the claim, we have a contradiction, since then in every neighborhood of $0$ there exist points where $|f-(g+L)|< |f-g|,$ violating the best approximation property of $g.$
Proof of claim: We start with
$$f(x_k)-(g(x_k)+L(x_k)) = f(x_k)-g(x_k)-L(x_k).$$
Now observe
$$ -L(x_k) =-r_kL(u_k)+r_kL(u_0)-r_kL(u_0)$$ $$= r_k(L(u_0)-L(u_k))-r_k\epsilon/2 .$$
Now $L(u_0)-L(u_k)\to 0$ as $k\to \infty.$ So there exists $K$ such that $|L(u_0)-L(u_k)|<\epsilon/4$ for $k\ge K.$ It follows that for such $k,$
$$-3r_k\epsilon/4 < -L(x_k) < -r_k\epsilon/4.$$
Since $f(x_k)-g(x_k)\ge \epsilon r_k,$ the claim is proved.
I got a bit lost in the notation in your attempt - you can get away with using a lot fewer variables than you're doing. Just for clarity, I'll prove the following, where I've also translated everything to $0$:
Suppose that $f:\mathbb R^n\rightarrow\mathbb R^m$ is a function and $g:\mathbb R^n\rightarrow\mathbb R^m$ is an affine function with the property that for every other affine $h:\mathbb R^n\rightarrow\mathbb R^m$ there exists some $\delta$ such that if $|x|<\delta$ then $$|f(x)-g(x)|\leq |f(x)-h(x)|.$$ Then, $$\lim_{x\rightarrow 0}\frac{|f(x)-g(x)|}{|x|}=0.$$
And, since we're going to unravel all of the notation anyways, we may as well write the conclusion as its definition:
For every $\varepsilon > 0 $ there exists some $\delta$ such that if $|x|<\delta$ then $|f(x)-g(x)| \leq \varepsilon |x|$.
Let's prove the theorem by contrapositive - suppose there were some $\varepsilon > 0$ and some sequence of $x_n$ approaching $x$ such that $|f(x_i)-g(x_i)| > \varepsilon|x_i|$. We will show that $g$ is not the best linear approximation of $f$.
such that for all $\delta > 0$ there would exist some $|x|<\delta$ such that $|f(x_i)-g(x_i)| > \varepsilon |x_i|$. We want to show that $g$ is not the best linear approximation.
Let $D$ be the set of linear transformations (not affine transformations) of operator norm exactly $\varepsilon$. This is a compact set. For each $x_i$, choose some $M_i\in D$ such that $|f(x_i)-g(x_i)-M_i(x_i)| = |f(x_i)-g(x_i)| - \varepsilon |x_i|$ - any matrix which sends $x_i$ to something parallel and cancelling with to the error vector $f(x_i)-g(x_i)$ suffices. Note that if $\|M-M_i\| < \varepsilon$ in the operator norm, then $|f(x_i)-g(x_i)-M(x_i)| < |f(x_i)-g(x_i)|$.
Let $M$ be any limit point of the sequence $M_i$, noting that such an $M$ exists by compactness of $D$. Then, we observe that there must be infinitely many $x_i$ such that $|f(x_i)-g(x_i)-M(x_i)| < |f(x_i)-g(x_i)|$ since $M$ is an limit of some subsequence. Note that $M\neq 0$, thus $g$ is not a better affine transformation than $g+M$ - as desired. Thus, if $\frac{|f(x)-g(x)|}{|x|}$ fails to converge to $0$, $g$ is not the best affine approximation of $f$.
However: I don't think there are many functions of two variables that have a best affine approximation. The trouble is that if you had some continuous function $f:\mathbb R^n\rightarrow\mathbb R$ and a linear function $g:\mathbb R^n\rightarrow\mathbb R$ and some line $\ell$ through the origin on which $g$ did not exactly equal $f$ on an open set around the origin, then any other linear function $h:\mathbb R^n\rightarrow\mathbb R$ which agreed with $g$ on $\ell$ but differed elsewhere would, in any ball around the origin, have some point where $h$ approximates $f$ better than $g$, since if some point on $\ell$ fails to agree exactly with $g$, by moving slightly to some direction, we can make $h$ either slightly greater or less than $g$, and thus bring it closer to the value of $f$. I would suspect that this means that a continuous function with a best linear approximation is in fact linear itself. This issue doesn't come up in one dimension where this geometric problem doesn't arise.