Understanding the derivative as a linear transformation
Solution 1:
The point is that for a function $f : \mathbb{R} \to \mathbb{R}$, $f'(a)$ defines a linear transformation, just life $Df({\bf a})$ does for a function $f : \mathbb{R}^n \to \mathbb{R}^m$.
In single variable calculus, we are taught that the derivative of $f(x)$ at a point $x = a$ is a real number $f'(a)$ which represents the slope of the tangent line to the graph of $f(x)$ at the point $x = a$. The equation of this tangent line is $y = f'(a)(x-a) + f(a)$; this is the best linear approximation of $f(x)$ near $x = a$, not the derivative itself.
If we do the change of variables $x^* = x - a$, $y^* = y - f(a)$, the tangent line becomes $y^* = f'(a)x^*$; this is a linear function, which is just a linear transformation $\mathbb{R} \to \mathbb{R}$, and the standard matrix of this linear transformation is the $1\times 1$ matrix $[f'(a)]$.
In higher dimensions, we start with $f : \mathbb{R}^n \to \mathbb{R}^m$ and at a point ${\bf a} \in \mathbb{R}^n$ we have the derivative $Df({\bf a})$ which is an $m\times n$ matrix $Df({\bf a}) = \left[\frac{\partial f_i}{\partial x_j}({\bf a})\right]$ which is sometimes called the Jacobian of $f$ at ${\bf a}$. Then the best linear approximation of $f({\bf x})$ near ${\bf x} = {\bf a}$ is ${\bf y} = Df({\bf a})({\bf x}-{\bf a}) + f({\bf a})$.
If we do the change of variables ${\bf x}^* = {\bf x} - {\bf a}$, ${\bf y}^* = {\bf y} - f({\bf a})$, the tangent line becomes ${\bf y}^* = Df({\bf a}){\bf x}^*$; this is a linear transformation $\mathbb{R}^n \to \mathbb{R}^m$, and the standard matrix of this linear transformation is the $m\times n$ matrix $Df({\bf a})$.
So the derivative in single variable calculus is just a special case of the derivative in multivariable calculus; just set $m = n = 1$.
As for your question, 'why can't we use a number for the best linear approximation for a function $\mathbb{R}^n \to \mathbb{R}^m$?', note that the approximating function must be $\mathbb{R}^n \to \mathbb{R}^m$, and because it is linear, it must be of the form ${\bf y} = A{\bf x} + {\bf b}$ where $A$ an $m \times n$ matrix and ${\bf b} \in \mathbb{R}^m$. By enforcing the condition that the linear approximation must agree with the function at ${\bf x} = {\bf a}$, we find that the linear approximation must be of the form ${\bf y} = A({\bf x} - {\bf a}) + f({\bf a})$. So the only thing left to determine is the $m\times n$ matrix $A$, not a single number as in single variable calculus.
Solution 2:
I have little to add to Michael's excellent answer. However, Dieudonne said it best: this is the introduction to his chapter on differentiation in Modern Analysis Chapter VIII.
The subject matter of this Chapter is nothing else but the elementary theorems of Calculus, which however are presented in a way which will probably be new to most students. That presentation, which throughout adheres strictly to our general "geometric" outlook on Analysis, aims at keeping as close as possible to the fundamental idea of Calculus, namely the "local" approximation of functions by linear functions. In the classical teaching of Calculus, the idea is immediately obscured by the accidental fact that, on a one-dimensional vector space, there is a one-to- one correspondence between linear forms and numbers, and therefore the derivative at a point is defined as a number instead of a linear form. This slavish subservience to the shibboleth of numerical interpretation at any cost becomes much worse when dealing with functions of several variables...
In other words, the confusion you face follows from thinking of the derivative wrongly in calculus I. "Wrong" in the sense that the idea does not generalize to higher dimensions directly. The derivative of a function from $\mathbb{R}^n \rightarrow \mathbb{R}^m$ is not another function from $\mathbb{R}^n \rightarrow \mathbb{R}^m$. Instead, it's a linear transformation, or if you prefer the Jacobian viewpoint, a matrix of functions.