What is the strongest possible statement of the idea that "the tangent line is the best linear approximation"?
For instance, I've just checked that that if you take the best linear approximation (in the $L^2$ sense) to a sufficiently nice function $f$ on the interval $[-\varepsilon, \varepsilon]$, and then let $\varepsilon \to 0$, you get $f(0) + x f'(0)$.
Surely we could make this stronger -- I imagine the analogous statements should hold for, say, the $L^1$ norm as well, or for most reasonable norms. Can we go farther, though?
Question: What is the strongest precise definition we can give the word "best" so that we have a statement of the form "the tangent line is the best linear approximation to a differentiable function"? (Feel free to replace "differentiable" with, say, $C^2$ or something if it makes for a more interesting answer.)
(Note: I'm aware of similar-sounding questions here, such as In what sense is the derivative the "best" linear approximation?, but the answers there don't answer my question.)
If $f$ is differentiable at $0$ the the same statement holds with uniform approximation instead of $L_2$.
Proof: Let $T(x) = f(0) + xf'(0)$. By the definition of differentiability we have $$ f(x) - T(x) = \mathcal o(|x|) $$ and thus $$ \sup_{|x|\le\varepsilon} |f(x) - T(x)| = \mathcal o(\varepsilon).$$ So this would also be satisfied by the best approximation on interval $[-\varepsilon, \varepsilon]$ say $g_\varepsilon$. In particular, for any mapping $\varepsilon\mapsto x_\varepsilon \in [-\varepsilon, \varepsilon]$ we have $$ |f(x_\varepsilon) - g_\varepsilon(x_\varepsilon)| \le \sup_{|x|\le \varepsilon} |f(x) - g_\varepsilon(x)| \le \sup_{|x|\le \varepsilon} |f(x) - T(x)| = \mathcal o(\varepsilon).$$ Hence, we have $$g_\varepsilon(0) = f(0) + \mathcal o(\varepsilon)$$ and $$ g_\varepsilon' = \frac{g_\varepsilon(\varepsilon) - g_\varepsilon(-\varepsilon)}{2\varepsilon} = \frac{f(\varepsilon) - f(-\varepsilon)}{2\varepsilon} + \mathcal o(1) \to f'(0).$$
For $L_p$ average, $1\le p < \infty$:
Fix some $p\in [1,\infty)$. For any measurable $\phi$ denote its average $L_p$ norm by $$N_\varepsilon \phi = \sqrt[p]{\frac{1}{2\varepsilon} \int_{-\varepsilon}^{\varepsilon} |\phi(x)|^p\,dx}.$$ Let $g_\varepsilon$ be a $L_p$ best approximation on $[-\varepsilon, \varepsilon]$. Then, we also have $$ N_\varepsilon (f - g_\varepsilon) \le N_\varepsilon (f - T) = \mathcal o(\varepsilon). $$ and $$ N_\varepsilon (g_\varepsilon - T) \le N_\varepsilon (f - g_\varepsilon) + N_\varepsilon (f - T) = \mathcal o(\varepsilon). $$