Solution 1:

The initial formula can be thought of as Taylor expansion out to second order. The rules can be understood in the framework of Riemann sums. An integral $dB$ in sum form looks like:

$$\sum f_i (B_{i+1}-B_i).$$

An integral $dt$ in sum form looks like

$$\sum f_i (t_{i+1}-t_i).$$

An integral $(dB)^2$ in sum form looks like

$$\sum f_i (B_{i+1}-B_i)^2.$$

An integral $dt dB$ in sum form looks like

$$\sum f_i (t_{i+1}-t_i)(B_{i+1}-B_i).$$

Finally an integral $(dt)^2$ in sum form looks like

$$\sum f_i (t_{i+1}-t_i)^2.$$

When you derive Ito's formula, you prove two things. One is fairly trivial: the fourth and fifth types of sums converge to zero as you refine the mesh. This is because, if the time step size is $h$, then there are $O(1/h)$ terms in the sum and the summands are $o(h)$. The other is quite nontrivial: the third type of sum converges, not to zero, but to an integral of the second type.

We rephrase these two observations heuristically by saying that $(dB)^2=dt$ and anything higher order than $dt$ is zero.

Incidentally, in Radically Elementary Probability Theory (this links to a pdf of the book from the author's web site), Nelson shows that one can actually say things like $dB=\pm \sqrt{dt}$ rigorously within the framework of hyperreal-type nonstandard analysis. In standard analysis language, this amounts to saying that if $D_k$ are iid variables equally likely to be $+1$ or $-1$, then

$$B(t) = \lim_{h \to 0^+} \sum_{k=1}^{\left \lfloor \frac{t}{h} \right \rfloor} \sqrt{h} D_k.$$

Solution 2:

A simplified way of thinking about this is to interpret the equation as saying what approximately happens to the value in the left hand side when time advances a little.

If the present time is $t$ and a very short interval of time $\Delta t$ passes, then the value of $f(t,X(t))$ will change approximately by the amount

$$ \dfrac{\partial f(t,X(t))}{\partial t}\Delta t + \dfrac{\partial f(t,X(t))}{\partial x}\Delta X(t) +\dfrac{1}{2}\dfrac{\partial^2 f(t,X(t))}{\partial x^2}(\Delta X(t))^2. $$

Note, that this approximation involves approximating $\Delta X(t)$, i.e. how the process $X$ changes between $t$ and $t+\Delta t$. This depends on how the process $X$ is defined.

If the process $X$ is defined in terms of time and a Brownian process, for example

\begin{align} \mathrm dX(t) &= \mathrm d t + \sigma \mathrm d B(t) \\ \overset{\int_0^t} \implies X(t) - X(0) &= t + \sigma B(t),\end{align}

then $\Delta X(t)$ will depend on $\Delta t$ and $\Delta B(t)$, so you will encounter products of the form $\Delta t \Delta B(t), \Delta t ^ 2$ and $\Delta B(t)^2$ while trying to find $\Delta X(t)^2$.

The expression $\Delta t \Delta B = 0$ tells you to discard any terms containing $\Delta t \Delta B$ in your calculation of $\Delta X(t)$. Similarly, the other relations tell you how to get rid of the other "products of differentials".

The mathematical justification of this is that in the limit when $\Delta t \to 0$ (which you take when integrating) those products behave just like their simpler alternative forms. That is, $\Delta t^2$ and $\Delta t \Delta B$ become negligible and, surprisingly, $\Delta B^2$ behaves just like $\Delta t$.