Why is it considered that $(\mathrm d x)^2=0$?

Why is it okay to consider that $(\mathrm d x)^n=0$ for any n greater than $1$? I can understand that $\mathrm d x$ is infinitesimally small ( but greater than $0$ ) and hence its square or cube should be approximately equal to $0$ not exactly $0$ .

But if this is so then how can we expect the results obtained from calculus to be exact and not just approximate ( like the slope or area under a curve )?

I have also noticed some anomalies, like $\sqrt{ (\mathrm d x)^2 + (\mathrm d y)^2 }$ is $0$ but $\mathrm d x\sqrt{1+ (\mathrm d y/\mathrm d x)^2 }$ is not $0$ when these two things are apparently the same . Moreover we can claim that

$$(\mathrm d x)^2=(\mathrm d x)^3=(\mathrm d x)^4 = \cdots = 0$$

which is quite hard to believe.

Can you help me figure out the logic behind these things ?

It is not the square of $\mathrm{d}x$ that is $0$. It is $\mathrm{d}x\land\mathrm{d}x$ that is zero.

This comes into play in differential form and in changes of variables. Suppose that $u=x+y$ and $v=x-y$. Then $$ \begin{align} \iint f\,\mathrm{d}u\,\mathrm{d}v &=\iint f\,\mathrm{d}(x+y)\,\mathrm{d}(x-y)\\ &=\iint f\,\mathrm{d}x\,\mathrm{d}x+\iint f\,\mathrm{d}x\,\mathrm{d}y-\iint f\,\mathrm{d}y\,\mathrm{d}x-\iint f\,\mathrm{d}y\,\mathrm{d}y\\ &=\iint f\,\mathrm{d}x\,\mathrm{d}y-\iint f\,\mathrm{d}y\,\mathrm{d}x\tag{1} \end{align} $$ Why do $\iint f\,\mathrm{d}x\,\mathrm{d}x=\iint f\,\mathrm{d}y\,\mathrm{d}y=0$? Well, inside the outer integral, $x$ is supposed to be held constant, so the inner $\mathrm{d}x$ will vanish. The same goes for the double $y$ integral.

Another consequence of this follows from $$ \begin{align} 0 &=\iint f\,\mathrm{d}u\,\mathrm{d}u\\ &=\iint f\,\mathrm{d}(x+y)\,\mathrm{d}(x+y)\\ &=\iint f\,\mathrm{d}x\,\mathrm{d}x+\iint f\,\mathrm{d}x\,\mathrm{d}y+\iint f\,\mathrm{d}y\,\mathrm{d}x+\iint f\,\mathrm{d}y\,\mathrm{d}y\\ &=\iint f\,\mathrm{d}x\,\mathrm{d}y+\iint f\,\mathrm{d}y\,\mathrm{d}x\tag{2} \end{align} $$ That is, $\mathrm{d}y\land\mathrm{d}x=-\mathrm{d}x\land\mathrm{d}y$. Thus, the integral in $(1)$ is equal to $$ 2\iint f\,\mathrm{d}x\,\mathrm{d}y\tag{3} $$

Note

It is not the case that $\sqrt{\mathrm{d}x^2+\mathrm{d}y^2}=0$. It is the same as $\mathrm{d}x\sqrt{1+\left(\frac{\mathrm{d}y}{\mathrm{d}x}\right)^2}$ in whatever context they both make sense.

Classical authors like Pierre de Fermat and Gottfried Wilhelm Leibniz discarded higher order terms in infinitesimal $E$ (in the case of Fermat) or $dx$ (in the case of Leibniz) while fully understanding that the terms are not being set to zero but rather discarded. In other words, they used a generalized notion of relation of equality up to a negligible term.

Fermat specifically introduced a term that is translated into English as adequality to refer to such a more general relation. Leibniz is quite specific in his writing (for example in his published response to Nieuwentijt in 1695) that he is working with such a generalized relation of equality.

In modern infinitesimal theories, this type of relation is formalized in terms of what is known as the standard part function (or shadow). Thus, the calculation of the ratio $\frac{\Delta y}{\Delta x}$ for $y=x^2$ will yield not the expected $2x$ but rather the infinitely close quantity $2x+\Delta x$ where $\Delta x$ is infinitesimal. To calculate the derivative at a real point $x=c$ one takes the standard part of $2c+\Delta x$ to obtain $2c$, the expected answer.

Thus when expanding the expression $(x+dx)^2=x^2+2dx+dx^2$ one does not set the term $dx^2$ equal to zero, even though superficially it may seem that one is doing just that. One has to see the broader picture when these expressions are set in relation to one another to understand what is going on.

A broader perspective on these developments can be found in this recent article. For additional articles in this area see this page.

In standard analysis there are no infinitesimals. $dx$ is merely an element of syntax used in expressing $\frac{df}{dx}$ and $\int f(x) dx$ and nothing more. Instead everything gets defined in terms of bounds on real numbers. In particular limits are defined in terms of bounds on real numbers, which gets you derivatives and integrals. In this setting, a situation where you would see $dx^2$ if you were using infinitesimals might be differentiating $x^2$. In this case you find $\frac{(x+h)^2-x^2}{h}=2x+h$. This $h$ term is not zero...but if $x$ is not zero and $h$ is going to zero then it is much smaller than the $2x$ to which it is being added. That is, the leading order term of $(x+h)^2$ is $x^2$; the first order correction is $2xh$.

Much of calculus is purely concerned with leading order terms and first order corrections. Much of the rest of it confines attention to second order corrections. Despite this, if you had $h^k$ by itself for some large integer $k$, you would not think of it as actually being zero; you only neglect it when it is being added to something much larger than itself. Thus in the infinitesimal language you shouldn't really think of $dx^2$ as being zero, but rather so much smaller than $dx$ that $dx+dx^2$ can be treated like $dx$. (In particular, under normal circumstances $\sqrt{(dx)^2+(dy)^2}$ can be interpreted as $dx \sqrt{1+(dy/dx)^2}$.)

This infinitesimal language can be formalized, resulting in theories which are referred to as nonstandard analysis. There are basically two ways to do this. One is smooth infinitesimal analysis which actually uses nilpotent infinitesimals, i.e. "nonzero" numbers with some power of them being zero. For instance for a nilsquare infinitesimal $dx$ you have $f(x+dx)=f(x)+f'(x)dx$ as an exact equality in SIA.

SIA is a somewhat foreign theory, for at least two reasons. First, some finesse with logic is required to make it work without contradictions. You can't define SIA in classical logic, it is an inconsistent theory there, because (as you hinted at) one can use excluded middle and the field axioms to prove that $(dx)^2=0$ implies $dx=0$. Intuitionistic logic dodges this issue. Second, SIA, as the name suggests, describes a "smooth universe": all the functions in it are infinitely differentiable. Standard analysis deals with less regular functions quite routinely.

The other main way to formalize infinitesimals is hyperreal analysis, which is suited to describe exactly the same things as standard analysis, in a certain precise and very strong sense. Hyperreal analysis has infinitesimals, but they are not nilpotent. Instead, hyperreal analysis replaces the limits of standard analysis with a "standard part" operation, which takes a number with an ordinary real part and an infinitesimal part and "discards" the infinitesimal part.

I only mention these so that you know that there is some power beyond just intuition in the use of infinitesimals. Nevertheless I would strongly encourage you to learn the meaning of everything in the standard framework.

Revising based on the bounty commentary: first of all, one should not view $\sqrt{dx^2+dy^2}$ (intuitively the length of an infinitesimal line segment) as being zero. It is exactly the same as $|dx| \sqrt{1+(dy/dx)^2}$. (We might need the absolute value because $x$ might rise or fall along the path.) A more general way to handle this would be to parametrize the curve in terms of an additional variable $t$, so that $\sqrt{dx^2+dy^2}=dt \sqrt{(dx/dt)^2 + (dy/dt)^2}$. Now $t$ only goes up (by our choice) so no absolute value is required.

As for writing $dx+dx^2 \approx dx$, it really depends on the context. With derivatives, the whole point is not to exactly write down the function, it's all about linear approximation. Thus for instance when I write $(x+h)^2 \approx x^2+2xh$, I am doing that because I don't want to pay attention to terms of higher order than $h$, because those first two terms (the largest ones, if $h$ is small enough) are enough for whatever purpose I have.

On the other hand, a basic philosophy in calculus and (standard) analysis is that one can prove that two things are equal by proving that they are arbitrarily close together. So to follow your example, when you expand out a proof that $\int_0^\pi \sin(x) dx = 2$, you might show that there is a lower sum for $\int_0^\pi \sin(x) dx$ which is at least $2-\epsilon$ and an upper sum which is at most $2+\epsilon$, for each $\epsilon>0$. The partition depends on $\epsilon$, and that dependence is exactly where the "limit" operation is hidden. (In practice we don't do this, we just use the FTC, but the FTC is proven in this fashion.)

This post is meant an extended comment rather than an answer.

Ian points out that there are three ways to interpret the symbol "$dx$":

infinitesimal analysis;
hyperreal analysis;
differential forms.

The first two approaches are somewhat less standard, I think, and indeed, I know very little about either. As such, I'd like to comment on the question from the perspective of (3) differential forms.

In the theory of differential forms, the following five objects should be distinguished: $$dx, \ \ d(x^2), \ \ (dx)^2, \ \ \ dx \wedge dx, \ \ d(dx).$$

The object $dx$ is a "differential $1$-form." It is not zero.
The object $d(x^2)$ is equal to $2x\,dx$, which is also a "differential $1$-form." It is also not zero.
The object $(dx)^2$ is a "smooth quadratic form." It is not zero. Here, the squaring is an operation called "symmetric product."
The object $dx \wedge dx$ is a "differential $2$-form." It is equal to zero. The $\wedge$ symbol is an operation called "wedge product." The wedge product has the funny property that $dx \wedge dx = 0$, whereas $dx \wedge dy = -dy \wedge dx$ is not zero.
The object $d(dx)$ is a "differential $2$-form." It is equal to zero. In fact, the symbol $d$ is called the "exterior derivative," and has the funny property that $d(df) = 0$ for any function $f$.

While this does not answer the question per se, I hope this clarification will be useful to understanding.

Let us try to understand stuff at the intuitive level with the help of a toy problem. If you are looking for advanced mathematics, please skip this answer.

Suppose, a class of children is going from a place A to B. At the beginning of the journey, the teacher says, "Hi class! There is a bit of a problem. The speedometer of the bus is not working. But, we would need to calculate the speed of the bus for some time. Can we do it? I can tell you that for the next few seconds, the distance travelled by the bus $x = t^2$, where $x$ is in meters and $t$ is in seconds. Specifically, I want you to find out the speed at $t=2$ and $t=3$ seconds."

The class which has no concept of calculus, is puzzled at first. But slowly, they try to figure out some approximations.

Siddhartha: If we want to find the speed at $t=2$ seconds, we can have a look at the distance travelled b/w $t=1$ and $t=2$. That would be 3 meters, so we can say that the speed is greater than 3 m/s.

Akanksha: Good point. But instead of the previous one second, we can have a look at the next 1 second. In the next 1 second, the bus travels 5 m. So, the speed is lesser than 5 m/s. In fact, we can say that the speed is between 3m/s and 5m/s at $t=2$ seconds.

Harsh: But why are we taking the unit of time to be 1 second. If we reduce the time gap, we will get a better approximation, no?

Siddhartha: Lovely! Lets do it with a time gap of 1/2 seconds. Then. (Starts putting up numbers on paper and doing some addition subraction) Wow, so, with a time gap of 1/2 second, we can say that our speed is between 3.5 and 4.5 m/s

Akanksha: And we can repeat this process for smaller times as well. In fact, I have a feeling that if we take time gap to be 1/4 seconds, we will get speed between 3.75 and 4.25 seconds.

Teacher: Why don't you check that?

After a few seconds, Harsh verifies the claim. At this point, the teacher asks them to find a proof if this holds for general $t$ and $\Delta t$

So, the students do the calculation $v = ((t + \Delta t)^2 -t^2)/ \Delta t = (2t\Delta t + (\Delta t)2)/ \Delta t = 2t + \Delta t$

So, if we take the time gap to be $\Delta t$, we can say that our velocity lies between, $2t -\Delta t$ and $2t + \Delta t$. So, if we put our $\Delta t$ to be very small (approximately zero), we get our velocity as $2t$. We can call this our velocity just now.

Teacher: Excellent! The technical term for this is instantaneous velocity. Can you repeat the same procedure if I gave you $x = t^3$ instead?

Students (all excited): Yes sure!

$v = ((t + \Delta t)^3 -t^3)/ \Delta t = (3(\Delta t)t^2 + 3t(\Delta t)^2 + (\Delta t)^3))/ \Delta t = 3t^2 + 3t\Delta t + (\Delta t)^2$

Siddhartha: Teacher, I am getting this expression. What should I do now?

Teacher: Try for $t=2$. See, what happens?

Siddhartha: If I put $\Delta t$ to be very small, say 0.0001, I get values very close to 12.

Teacher: Lovely. What about $t=3$? General $t$?

Siddhartha: I can always put $\Delta t$ to be very very small. So, the only term which remains is $3t^2$.

Teacher (after waiting for others to catch up): Excellent! Now, do you notice that in effect, when we are expanding $(t + \Delta t)^n$, we can for our purposes ignore all powers greater than 2. So, we could have expanded $(t + \Delta t)^2$ as $(t^2 + 2t\Delta t)$ and $(t + \Delta t)^3$ as $(t^3 + 3t^2\Delta t)$ and still got the same answer.

Students fall silent for some time. After some time, a student breaks the silence.

Akanksha: It is because, in the division we have the power of $\Delta t$ as 1. So, any terms of higher power would become very small, when we make $\Delta t$ small. In fact, if we take $\Delta t$ to be almost zero, the higher powers would all be almost zero, since if $\Delta t = 0.0001$, its higher powers would be even smaller, in fact, much smaller.

Teacher: Excellent thinking Akanksha. In fact, all of you have done a great job. You have figured out the basics of calculus by yourself. Let me just fill in some nomenclature so that we can share with others our line of thought.

When we say that $\Delta t$ is almost zero, we write it as $\lim_{\Delta t \rightarrow 0}$. Since, this is used many many times, we actually save a lot of effort just by writing $dt$ instead of writing $\Delta t$ under $\lim_{\Delta t \rightarrow 0}$.

So, when, the denominator has the power of $dt$ at one, we can safely put $dt^2$, $dt^3 \ldots$ as 0. However, if the denominator has higher power of $dt$, then, we obviously cannot do this.

Can you understand this, my dear students?

Harsh: So, you are saying that we can ignore all powers of $dt$ higher than the lowest power in denominator.

Teacher: Yes.

Harsh: Would it also hold for non-integral powers?

Teacher: You say?

Harsh: It should, since $0.0001^{3/2}$ is still smaller than 0.0001, which we are taking to be almost zero.

Teacher: Lovely!

In our case, we can't say that $\sqrt{(dx)^2 + (dy)^2} = 0$, without more context. Specifically, the context required is whether or not we can ignore infinitesimal change in $x$. Neither can we say that $dx \sqrt{1 + (dy/dx)^2}$ is not zero for the same reason.

In fact, the two ($dx \sqrt{1 + (dy/dx)^2}$ and $\sqrt{(dx)^2 + (dy)^2}$) are identical. If one is zero, the other has to be.

What we can say is $\sqrt{1 + (dy/dx)^2}$ is non-zero. Because it is square root of (1 + square of something), hence, square root of (something always positive).

Why is it considered that $(\mathrm d x)^2=0$?

Related

Recent Posts