Definition of the gradient for non-Cartesian coordinates
Solution 1:
It turns out that there are two different but related notions of differentiation for a function $f:\mathbb R^n\to\mathbb R$: the total derivative $df$ and the gradient $\nabla f$.
- The total derivative is a covector ("dual vector", "linear form") and does not depend on the choice of a metric ("measure of length").
- The gradient is an ordinary vector and derived from the total derivative, but it depends on a metric. That why it looks a bit funny in different coordinate systems.
The definition of the total derivative answers the following question: given a vector $\vec v$, what is the slope of the function $f$ in the direction of $\vec v$? The answer is, of course
$$ df_{x}(\vec v) = \lim_{t\to0} \frac{f(x+t\vec v)-f(x)}{t}$$
I.e. you start at the point $x$ and walk a teensy bit in the direction of $\vec v$ and take note of the ratio $\Delta f/\Delta t$.
Note that the total derivative is a linear map $\mathbb R^n \to \mathbb R$, not a vector in $\mathbb R^n$. Given a vector, it tells you some number. In coordinates, this is usually written as
$$ df = \frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy + \frac{\partial f}{\partial z}dz $$
where $dx,dy,dz$ are the total derivatives of the coordinate functions, for instance $dx(v_x,v_y,v_z) := v_x$. This formula looks the same in any coordinate system.
In contrast, the gradient answers the following question: what is the direction of the steepest ascend of the function? Which vector $\vec v$ of unit length maximizes the function $df(\vec v)$? As you can see, this definition crucially depends on the fact that you can measure the length of a vector. The gradient is then defined as
$$ \nabla f = df(\vec v_{max})\cdot\vec v_{max} $$
i.e. it gives both the direction and the magnitude of the steepest change.
This can also be expressed as
$$ \langle \nabla f, \vec v \rangle = df(\vec v) \quad\forall \vec v\in\mathbb R^n.$$
In other words, the scalar product $\langle,\rangle$ is used to convert a covector $df$ into a vector $\nabla f$. This also means that the formula for the gradient looks very different in coordinate systems other than cartesian. If the scalar product is changed (say, to $\langle\vec a,\vec b\rangle := a_xb_x + a_yb_y + 4a_zb_z$), then the direction of steepest ascend also changes. (Exercise: Why?)
Solution 2:
For your first question, whether a function is "scalar-valued" or not doesn't depend on the coordinate system of the domain. Any function that evaluates to a value in the underlying field, in this case $\mathbb{R}$, is scalar-valued. Unless explicitly stated otherwise though, people usually assume Cartesian coordinates with the standard basis vectors when they are referencing $\mathbb{R}^n$.
Your second question takes a little more work to answer but the short answer is, yes, as it is typically defined, the gradient is specified in terms of Cartesian coordinates but there is a much better approach. I have recently been working on material that is directly related to this. See, for example, this question I recently posed. The point there was the following:
If $f:X\subset \mathbb{R}^n \rightarrow \mathbb{R}$ then the derivative of $f$ at $x_0$, $df_{x_0}$, is a linear function $df_{x_0}:\mathbb{R^n} \rightarrow \mathbb{R}$ By the Riesz representation theorem the exists a unique vector in $\mathbb{R}^n$, which we denote by $\nabla f(x_0)$, that satisfies
$$ df_{x_0}(v) = g(v, \nabla f(x_0)) $$ for every $v \in \mathbb{R}^n$ and where $g$ is an inner product on $\mathbb{R}^n$ Note that this definition is free of coordinates but does require the existence of an inner product (which may or may not be the standard one). See for example [AMANN, p 160] or [FRANKEL, p46] for a discussion of this perspective.
Now, it will turn out that if you do use standard Cartesian coordinate vectors then you can recover the "typical" definition of the gradient from this one. To see this though, and to see where the expression for the gradient in spherical coordinates that you provided in your question comes from, requires us to dig deeper.
Now, it can be shown that
$$ \nabla f(x_0) = (g^{1k} \partial_k f(x_0), \dots, g^{nk} \partial f(x_0)) $$
where $g^{ij}$ denotes the $i,j$ entry of the inverse of the matrix $G = [g_{ij}]$ I'll reference you again to my previous question for the details of this statement. So, this expression gives us a concrete way for actually calculating the gradient but in order to do so, we will need to figure out how to actually compute the matrix $G$
For a (tractable) example, let us consider polar coordinates. They are related to Cartesian coordinates by the well-known formulae $x = r \cos (\theta)$ and $y = r \sin (\theta)$. It can be shown that the matrix $G$ is determined by the relation $G = J^TJ$ where $J$ is the Jacobian of the transformation in question. See [KAY, p54] for a reference to this fact.
In the case of polar coordinates, the transformation is given by $$ T(r,\theta)= (r \cos (\theta), r \sin (\theta)) $$
The Jacobian of the transformation then is
$$ J = \bigl( \begin{array}{ccc} \cos (\theta) & -r \sin (\theta) \\ \sin (\theta) & r \cos(\theta) \end{array} \bigr) $$
After working through the details we find then that
$$ G = J^TJ = \bigl( \begin{array}{ccc} 1 & 0 \\ 0 & r^2 \end{array} \bigr) $$
Therefore,
$$ G^{-1} = \bigl( \begin{array}{ccc} 1 & 0 \\ 0 & \frac{1}{r^2} \end{array} \bigr) $$
From this matrix then we can read off the $g^{ij}$ components from which it follows that
$$ \nabla f(x_0) = (\partial_r f(x_0), \frac{1}{r^2} \partial_{\theta} f(x_0)) $$
But, we are still not done. This expression does not agree with what you will usually encounter which is
$$ \nabla f(x_0) = (\partial_r f(x_0), \frac{1}{r} \partial_{\theta} f(x_0)) $$
where there is a missing factor of $r$ in the second coordinate. So, what's going on here? First, we note that since we are working in $r - \theta$ coordinates, the gradient vector is relative to $r - \theta$ bases. Our component-wise notation is obscuring this fact. So what we actually have is
$$ \nabla f(x_0) = \partial_r f(x_0) e_r + \frac{1}{r^2} \partial_{\theta}f(x_0)e_{\theta} $$
where $e_r$ and $e_{\theta}$ denote the $r - \theta$ basis vectors. So, what are they? Well, you can always use geometry to figure this out but, since I'm really lousy at geometry I like to think of them as being defined analytically as tangent vectors. See [KOKS, p 298] for a discussion of this perspective. To determine them we just differentiate our transformation $T$, with respect to $r$ and $\theta$ respectively. Thus
$$ e_r = \partial_r T = (\cos (\theta), \sin(\theta)) $$
and
$$ e_{\theta} = \partial_{\theta}T = r(-\sin( \theta), \cos (\theta)) $$
Note though that while $e_r = \hat{e_r}$ is a unit vector $e_{\theta}$ is not. Using some algebra we see that $e_{\theta} = |r|\hat{e_{\theta}}$. Therefore, the gradient with respect to the unit basis vectors is given by
$$ \nabla f(x_0) = \partial_r f(x_0) \hat{e_r} + \frac{1}{r} \partial_{\theta}f(x_0)\hat{e_\theta} $$
and we thus have agreement with the common expression for the gradient in polar coordinates.
References:
[AMANN] Amann and Escher, Analysis II
[FRANKEL] Frankel, The Geometry of Physics
[KAY] Kay, Schaum's Outline of Tensor Calculus
[KOKS] Koks, Explorations in Mathematical Physics