Trouble with gradient intuition
Solution 1:
At each point in the $xy$-plane the function you have gives you a value. Normally this value is visualized as altitude, that is, a $z$-value, but it can just as easily be anything else. Examples that come to mind are grayscale tones (if the $xy$-plane is a black-and-white picture), density (if the $xy$-plane is a plate made from varying materials), or, maybe the most important one, potential in some force field (often gravitational or electric fields). Of course, in practice, these examples rarely gives you a function that is nice to work with, but in textbook examples it usually works out.
Now, imagine that you are walking around in the $xy$-plane, sniffing or measuring that value, whatever it represents. At some point in time you're standing at the point $(5, 3)$, and the gradient of your function at that point is $(-1, 2)$. That means if you turn around so that you're facing the direction $(-1, 2)$ at the point $(5, 3)$ (that is, you're looking directly at the point $(4, 5)$), then that is the direction where (at least for the first gazillionths of a meter you travel) your value will increase the fastest, out of all the directions you can pick.
I can do even better. I can tell you how much it will grow for that gazillionths of a meter. Since the amplitude of the gradient is $\sqrt 5$, your function value will grow approximately $\dfrac{\sqrt{5}}{\text{gazillion}}$ of whatever unit it's measured in. (Approximately because even over that short distance, the gradient might change a little bit. Studying how these changes affect the end result is what partial differential equations, and especially using computers to solve them, are all about. But that's another story entirely)
Traveling around in the $xy$-plane is something we mathematicians do all the time. When you finally get some intuition, it's much easier (and more fun) to play around with that intuition if you imagine yourself being at a point in the plane and walk around, rather than just imagining a point moving around.
Also, as for why the gradient gives you a maximal slope, say again that you have a gradient of $(-1, 2)$ at the point $(5, 3)$. That means if you go from that point, and move directly parallel to the $x$-axis (for a very short distance), in the positive direction (towards $(6, 3)$), then your function value will decrease at the speed of $1$ per meter travelled. This is, after all, how the $-1$ in $(-1, 2)$ came to be in the first place. So if you want the value to increase, you're better off moving in the negative $x$-direction.
Likewise, if you move in the positive $y$-direction (towards $(5, 4)$), your function value will increase by $2$ per meter travelled. So you can see that the function grows faster in the positive $y$-direction than in the negative $x$-direction. However, the direction that will give you the fastest growth is a balanced mix of the two. Hence both the directions $(-1, 0)$ and $(0, 1)$ will give you growth, but the fastest growth is the direction $(-1, 2)$. Twice as much in the $y$-direction because the function grows twice as fast that way as in the $x$-direction.
Solution 2:
Here's another viewpoint. From single variable calculus, we know that if a function $f$ is differentiable at $x$, then \begin{equation} f(x + \Delta x) \approx f(x) + f'(x) \Delta x \end{equation} and the approximation is good when $\Delta x$ is small.
The situation in multivariable calculus is analogous. Suppose $f:\mathbb R^2 \to \mathbb R$ is differentiable at a point $x = (x_1,x_2) \in \mathbb R^2$. Then \begin{equation} f(x+\Delta x) \approx f(x) + \langle \nabla f(x), \Delta x \rangle \end{equation} and the approximation is good when the vector $\Delta x$ is small.
We could ask, how should we pick $\Delta x$ so that $f(x + \Delta x)$ is as large as possible? We certainly don't want $\Delta x$ to be pointing in the opposite direction as $\nabla f(x)$, because then $\langle \nabla f(x), \Delta x \rangle$ will be negative -- the value of $f$ will decrease! Nor do we want $\Delta x$ to be orthogonal to $\nabla f(x)$, because then $\langle \nabla f(x), \Delta x \rangle = 0$, which doesn't seem to help $f$ get any larger. We want to pick $\Delta x$ to be in the same direction as $\nabla f(x)$.
Solution 3:
This graph (click here) tries to give some intuition
1) Vector A is the gradient
2) z=f(x,y) is not shown in the graph
3) Don't miss the point that both angles (theta) are the same!
4) Point represented by the end of vector B is the result of moving in the x and y direction from (1,2) but maintaining the proportion given by the gradient vector. So, if A gives a 2/1 relation –y against x–, then you can remove two units of x (dz/dx=1) and add one unit of y (dz/dy=2), and z will remain the same.
5) The magnitude of vector B is higher that the magnitude of vector A. So, if you take a vector in the B direction but with the magnitude of vector A –that is, smaller than B–, then the positive impact in z due to the variation of y will be lower than the negative impact in z due to the variation of x (the angle formed by the variation of both x and y will be less than theta).
6) Finally, either A' or A'' share the same magnitude than A, but in both cases, the trade off between x and y causes z to go down.