Why must the gradient vector always be directed in an increasing direction?
Intuitively, $f(x + \Delta x) \approx f(x) + \langle \nabla f(x), \Delta x \rangle$. (I'm using the convention that $\nabla f(x)$ is a column vector.) So if $\Delta x = \epsilon \nabla f(x)$ (here $\epsilon > 0$ is tiny), then \begin{align*} f(x + \Delta x) & \approx f(x) + \epsilon \langle \nabla f(x), \nabla f(x) \rangle \\ &= f(x) + \epsilon \| \nabla f(x) \|^2 \\ &\geq f(x). \end{align*}
So when we move a bit in the direction of $\nabla f(x)$, the value of $f$ increases.
It's not obvious:
Consider the function $$f(x,y):=\cases{0&$\bigl((x,y)=(0,0)\bigr)$,\cr x+y-{4|xy|^{4/3}\over x^2+y^2}&(else) .\cr}$$ Then $f$ is continuous at $(0,0)$ and $\nabla f(0,0)=(1,1)$, but $$f(t,t)-f(0,0)=2t-{4|t|^{8/3}\over 2t^2}=2|t|^{2/3}\bigl(|t|^{1/3}{\rm sgn(t)}-1\bigr)<0\qquad(0<|t|<1)\ .$$ This shows that $f$ is actually decreasing in the direction of the gradient.
Now the considered $f$ is not differentiable at $(0,0)$, and the gradient defined via partial derivatives exists only by coincidence. For any $f$ which is actually differentiable at $(0,0)$ one has $$f(x,y)-f(0,0)=\nabla f(0,0)\cdot (x,y)+o(r)\qquad(r:=\sqrt{x^2+y^2}\to0)\ .$$ Now, if $\nabla f(0,0)=(a,b)\ne(0,0)$ and you choose $(x,y):=(ta,tb)$ with $t>0$ then $$f(ta,tb)-f(0,0)=t(a^2+b^2)+o(t)=t (a^2+b^2)(1+o(1))\qquad(t\to0)\ ;$$ and therefore $f(ta,tb)-f(0,0)$ is $>0$ for sufficiently small $t>0$.
I prefer to explain that is slightly different way:
Actually we define gradient to be always pointing to to the maximum increasing direction! take look at the following:
Consider a function $f(x,y)$, then it's full derivative is:
$df(x,y)=\frac{\partial f}{\partial x}dx+\frac{\partial f}{\partial y}dy=\left(\frac{\partial f}{\partial x},\frac{\partial f}{\partial y}\right)\left(dx,dy\right)=\left(\frac{\partial f}{\partial x},\frac{\partial f}{\partial y}\right)\vec{dr}=\left\Vert \left(\frac{\partial f}{\partial x},\frac{\partial f}{\partial y}\right)\right\Vert \left\Vert \vec{dr}\right\Vert \cos\alpha$
so if we consider for simplicity that $\left\Vert \vec{dr}\right\Vert =1$ finaly we get that:
$df(x,y)=\left\Vert \left(\frac{\partial f}{\partial x},\frac{\partial f}{\partial y}\right)\right\Vert \cos\alpha$
So because cosine function is always less or equal to one , we see that the first term is the maximum possible value for our function increase (because that correspond to $\alpha=0$ ) thus if we define this first term as the length of some vector and we name it gradient, then this vector will point out to the direction of maximum possible increase of our function $f(x,y)$.
Intuitively:
if the function is decreasing in one variable, then the partial derivative is negative, so the component vector of the gradient for that variable points in the negative direction - which means increasing function value.
if the function is increasing in one variable, then the partial derivative is positive, so the component vector of the gradient for that variable points in the positive direction - which means increasing function value.
=> Doesn't matter how the function profile is, the gradient, by definition, points in the increasing direction.