Why aren't integration and differentiation inverses of each other?

Integration is supposed to be the inverse of differentiation, but the integral of the derivative is not equal to the derivative of the integral:

$$\dfrac{\mathrm{d}}{\mathrm{d}x}\left(\int f(x)\mathrm{d}x\right) = f(x) \neq \int\left(\dfrac{\mathrm{d}}{\mathrm{d}x}f(x)\right)\mathrm{d}x$$

For instance: $$\begin{align*} &\dfrac{\mathrm{d}}{\mathrm{d}x}\left(\int 2x+1\;\mathrm{d}x\right) &&= \dfrac{\mathrm{d}}{\mathrm{d}x}\left(x^2+x+C\right) &= 2x+1\\ &\int\left(\dfrac{\mathrm{d}}{\mathrm{d}x}\left(2x+1\right)\right)\mathrm{d}x &&= \int 2\;\mathrm{d}x &= 2x+C\end{align*}$$

Why isn't it defined such that $\dfrac{\mathrm{d}}{\mathrm{d}x}a = \dfrac{\mathrm{d}a}{\mathrm{d}x}$, where $a$ is a constant, and $\int f(x)\;\mathrm{d}x = F(x)$? Then we would have: $$\begin{align*} &\dfrac{\mathrm{d}}{\mathrm{d}x}\left(\int 2x+1\;\mathrm{d}x\right) &&= \dfrac{\mathrm{d}}{\mathrm{d}x}\left(x^2+x\right) &= 2x+1\\ &\int\left(\dfrac{\mathrm{d}}{\mathrm{d}x}\left(2x+1\right)\right)\mathrm{d}x &&= \int \left(2+\dfrac{\mathrm{d1}}{\mathrm{d}x}\right)\;\mathrm{d}x &= 2x+1\end{align*}$$

Then we would have:

$$\dfrac{\mathrm{d}}{\mathrm{d}x}\left(\int f(x)\mathrm{d}x\right) = f(x) = \int\left(\dfrac{\mathrm{d}}{\mathrm{d}x}f(x)\right)\mathrm{d}x$$

So what is wrong with my thinking, and why isn't this the used definition?


The downvotes on this question aren't really fair. Sure, it's a rather simple question to answer trivially, but understanding exactly why derivatives and integrals aren't perfect inverses (and how to deal with that) is hugely important for everything from solving differential equations to constructing topological invariants on manifolds. Furthermore, it's common for students to think invertible operations are always preferable since they preserve all the information, so it's useful to present examples which show that noninvertibility and information loss is not only acceptable, but in many cases desirable. Some of this may go over your head initially (I tried to keep the level accessible to undergraduate students but in the end I had to assume a bit of background), but I hope you'll at least get some idea of what I'm saying and possibly take up studying some of these things in more depth if you're interested in them.

At the most fundamental level, it's easy to see exactly what fails. Suppose we take any two arbitrary constant functions $f_1: x \mapsto c_1$ and $f_2 : x \mapsto c_2$. Then $\frac{d f_1}{dx} = \frac{d f_2}{dx} = 0$. When you define an antiderivative $\int$, you'll have to pick a particular (constant) function $\int 0 dx = c$. Assuming $c_1 \ne c_2$, we can't simultaneously satisfy both $\int \frac{d f_1}{dx} dx = c_1$ and $\int\frac{d f_2}{dx}dx=c_2$, since $c$ can't equal both $c_1$ and $c_2$. This comes directly from the definition of the derivative as $$\displaystyle \frac{df}{dx}|_{x = x_0} = \lim_{x \rightarrow x_0} \frac{f(x) - f(x_0)}{x-x_0},$$ and so there's no helping it so long as we keep this our definition of the derivative and allow arbitrary functions.

Furthermore, this is in some sense the "only" way this can fail. Let's suppose we have two arbitrary smooth functions $g_1$ and $g_2$, such that $\frac{dg_1}{dx} = \frac{dg_2}{dx}$. Then it must hold that $\frac{d}{dx}(g_1-g_2)=0$, and so $g_1(x) - g_2(x) = c$ is some constant function. The only way the derivative of two smooth functions on $\mathbb R$ to be the same is if the difference of the two functions is a constant. Of course, for differentiation to be invertible, we need to have a function which is infinitely differentiable i.e. smooth.

If you've studied linear algebra, you'll realize what we're doing is actually familiar in that context. (If not, while this answer could probably be written without making extensive use of linear algebra, it would necessarily be far longer, and it's probably already too long, so I apologize in advance that you'll probably have a hard time reading some of it.) Rather than thinking of functions as individual objects, it's useful to collect all smooth functions together into a single object. This is called $C^\infty(\mathbb R)$, the space of smooth functions on $\mathbb R$. On this space, we have point-wise addition and scalar multiplication, so it forms a vector space. The derivative can be thought of as a linear operator on this space $D: C^{\infty}(\mathbb R) \rightarrow C^\infty(\mathbb R), f \mapsto \frac{df}{dx}$. In this language, what we've shown is that $\ker D$ (the kernel of $D$, that is, everything which $D$ sends to $0$) is exactly the set of constant functions. A general result from linear algebra is that a linear map is one-to-one if and only if its kernel is $0$.

If we want to make $D$ invertible (so that we could define an anti-derivative $\int$ which is a literal inverse to $D$), we have a few options. I have to warn you that the terminology and notation here is nonstandard (there is no standard for most of it) so you'll need to use caution comparing to other sources.


Option 1: Try to add more "functions" to to $C^\infty (\mathbb R)$ which will make $D$ invertible.

Here's one approach you can use. Define a larger space $\bar C^\infty(\mathbb R) \supset C^\infty(\mathbb R)$ by allowing addition to each function by an arbitrary finite formal sum of the form $a_1 \zeta_1 + a_2 \zeta_2 + a_3 \zeta_3 + \cdots + a_n \zeta_n$, where the $\zeta_i$ are formal parameters. A generic element of $\bar C^\infty(\mathbb R)$ is something like $f(x) + a_1 \zeta_1 + a_2 \zeta_2 + a_3 \zeta_3 + \cdots + a_n \zeta_n$ for some $f \in C^\infty(\mathbb R)$, $n \ge 0$, and $a_1, \ldots, a_n \in \mathbb R$. The $\zeta_i$ will keep track of the information we'd lose normally by differentiating, so that differentiation will be invertible.

We'll define a derivative $\bar D$ on $\bar C^\infty(\mathbb R)$ based on the derivative $D$ we already know for $C^\infty(\mathbb R)$. For any function $f \in C^\infty (\mathbb R)$, let $(\bar D f)(x) = \frac{df}{dx} + f(0) \zeta_1$. The choice of $f(0)$ here is somewhat arbitrary; you could replace $0$ with any other real number, or even do some things which are more complicated. Also define $\bar D(\zeta_i) = \zeta_{i+1}$. This defines $\bar D$ on a set which spans our space, and so we can linearly extend it to the whole space. It's easy to see that $\bar D$ is one-to-one and onto; hence, invertible. The inverse to $\bar D$ is then $$\int (F(x') + a_1 \zeta_1 + \cdots + a_n \zeta_n) dx' = \int_0^x F(x') dx' + a_1 + a_2 \zeta_1 + \cdots+ a_n \zeta_{n-1}.$$ This is a full inverse to $\bar D$ in the sense that ordinary anti-derivatives fail to be.

This is, in fact, almost exactly what the OP was trying to do. He wanted to add formal variables like $\frac{da}{dx}$ so that if $a \ne b$ are constants, $\frac{da}{dx} \ne \frac{db}{dx}$. In my notation, "$\frac{da}{dx}$" is $a \zeta_1$. That alone doesn't save you, since you don't know how to differentiate $\zeta_1$, so you have to add another parameter $\zeta_2 = \bar D \zeta_1$, and to make $\bar D$ invertible you need an infinite tower of $\zeta_i$. You also need to handle what to do when neither $f_1$ nor $f_2$ is constant, but $f_1 - f_2$ is constant. But you can definitely handle all these things if you really want to.

The question here isn't really whether you can do it, but what it's good for. $\bar D$ isn't really a derivative anymore. It's a derivative plus extra information which remembers exactly what the derivative forgets. But derivatives aren't a purely abstract concept; they were created for applications (e.g. measuring the slope of curves), and in those applications you generally do want to get rid of this information. If you want to think of a function as something you can measure (e.g. in physics), the $\zeta_i$ are absolutely not measurable. Even in purely mathematical contexts, it's hard to see any immediate application of this concept. You could rewrite some of the literature in terms of your new lossless "derivative", but it's hard to see this rewriting adding any deep new insight. You shouldn't let that discourage you if you think it's an interesting thing to study, but for me, I have a hard time calling the object $\bar D$ a derivative in any sense. It isn't really anything like the slope of a curve, which doesn't care if you translate the curve vertically.

There is a useful construction hiding in this, though. Specifically, if you restrict the domain of $\bar D$ to $C^\infty(\mathbb R)$ and the range to its image (which is everything of the form $f + a \zeta_1$ for $f \in C^\infty(\mathbb R)$ and $a \in \mathbb R$), then $\bar D |_{C^{\infty(\mathbb R)}}$ is still an isomorphism, though now between two different vector spaces. Inverting $\bar D$ on this space is just solving an initial value problem:

\begin{align} F(x) &= \frac{df}{dx} \\ F(0) &= a \end{align}

The solution is $F(x) = a + \int_0^x f(x) dx$. You can do similar constructions for higher derivatives. Obviously, solving initial value problems is highly important, and in fact the above construction is a natural generalization of solving initial value problems like this which allows for arbitrarily many derivatives to be taken. However, even though $\bar D |_{C^\infty(\mathbb R)}$ is still a linear isomorphism, I have a hard time thinking of this as a true way to invert derivatives in the way the OP wants, since the domain and range are different. Hence, in my view this is philosophically more along the lines of Option 3 below, which is to say that it's a way to work around the noninvertibility of the derivative rather than a way to "fix" the derivative.


Option 2a: Get rid of problematic functions by removing them from $C^\infty (\mathbb R)$

You may say "rather than adding functions, let's get rid of problematic ones which make this not work". Let's just define the antiderivative to be $\int_0^x f(x') dx'$, and restrict to the largest subspace $\hat C^\infty (\mathbb R)$ on which $D$ and $\int$ are inverses. the choice of $0$ for the lower limit is again arbitrary, but changing it doesn't change the story much.

If you do this, you can see that any function $f \in \hat C^\infty (\mathbb R)$ needs to have $f(0) = 0$, since $f(0) = \int_0^0 \frac{df}{dx} dx = 0$. But requiring $\frac{df}{dx} \in \hat C^\infty (\mathbb R)$ as well means that we must also have $\frac{df}{dx}|_{x=0} = 0$, and by induction $\frac{d^nf}{dx^n}|_{x=0} = 0$ for all $n$. One way of saying this is that the Maclaurin series of $f$ must be identically $0$. If you haven't taken a class in real analysis (which you should if you're deeply interested in these things), you may have never encountered such a function before which has all derivatives $0$ at $x=0$ (other than the zero function), but they do exist; one example is

\begin{equation*} f(x) = \left\{ \begin{array}{lr} e^{-1/x^2} & : x \ne 0\\ 0 & : x = 0. \end{array} \right. \end{equation*}

There are many other examples, but I'll leave it at that; you can check the Wikipedia article on smooth, non-analytic functions for more information. On $\hat C^\infty (\mathbb R)$, the differential equation $D f = g$ for fixed $g \in \hat C^\infty (\mathbb R)$ has a unique solution given by $f(x) = \int_0^x g(x) dx$. It's worth pointing out that the only polynomial function in $\hat C^\infty (\mathbb R)$ is the $0$ function.

This space looks bizarre and uninteresting at first glance, but it's actually not that divorced from things you should care a lot about. If we return to $ C^\infty (\mathbb R)$, typically, we'll want to solve differential equations on this space. A typical example might look something like $$a_n D^n f + a_{n-1} D^{n-1} f + \cdots + a_0 f = g$$ for some fixed $g$. As you will learn from studying differential equations (if you haven't already), one way to get a unique solution is to impose initial conditions on $f$ of the form $f(0) = c_0, (Df)(0) = c_1 , \ldots, (D^{n-1} f) (0) = c_{n-1}$. All we've done in creating this space is find a class of functions upon which we can uniformly impose a particular set of initial conditions.

This set of functions also has another nice property which you might not notice at first pass: you can multiply functions. The object we constructed in Option 1 didn't really have an intuitive notion of multiplication, but now, we can multiply point-wise and still end up in the same space. This follows from the general Leibniz rule. In fact, not only can we multiply any two functions in $\hat C^\infty(\mathbb R)$, but we can multiply a function in $\hat C^\infty(\mathbb R)$ by an arbitrary smooth function in $C^\infty(\mathbb R)$. The result will still be in $\hat C^\infty(\mathbb R)$. In the language of algebra, $\hat C^\infty(\mathbb R)$ is an ideal of $C^\infty(\mathbb R)$, and ideals are well-studied and have lots of interesting and nice properties. We can also compose two functions in $\hat C^\infty(\mathbb R)$, and by the chain rule and we'll still end up in $\hat C^\infty(\mathbb R)$.

Unfortunately, in most applications, the space we've constructed isn't sufficient for solving differential equations which arise, despite all its nice properties. The space is missing very many important functions (e.g. $e^x$), and so the right hand side $g$ above will often not satisfy the constraints we impose. So while we can still solve the differential equations, to do it, we need to go back to a space on which $D$ is not one-to-one. However, it's worth pointing out that there are variants of this construction. Rather than putting initial conditions at $x=0$, in many applications, the more natural thing is to put conditions on the asymptotic behavior as $x \rightarrow \infty$. There are interesting parallels to our construction above in this context, but unfortunately, they're a bit too advanced to include here, as they require significantly more than what a typical undergraduate would know. In any case, these constructions are useful, as you would guess, in the study of differential equations, and if you decide to pursue that in detail, eventually you'll see something which looks qualitatively like this.


Option 2b: Get rid of problematic functions by equating functions in $C^\infty (\mathbb R)$

Those who have studied linear algebra will remember that there are two different concepts of a vector space being "smaller" than another vector space. One is subspaces. In 2a we constructed a subspace on which $D$ is one-to-one and onto. But we could just as well construct a quotient space with the same property.

Let's define an equivalence relation on $C^\infty(\mathbb R)$. We'll say $f_1 \sim f_2$ if and only if $f_1(x) - f_2(x)$ is a polynomial function in $x$, i.e. a function such that for some $n$, $D^n (f_1 - f_2) = 0$. Let $[f]$ be the set of all functions $g$ such that $g \sim f$ i.e. the equivalence class of $f$ under $\sim$. We'll call the set of all equivalence classes $\tilde C^\infty(\mathbb R)$. A generic element of $\tilde C^\infty(\mathbb R)$ is a collection $[f]$ of functions in $C^\infty(\mathbb R)$, all of which differ from each other only by polynomial functions.

On this space, we have a derivative operator $\tilde D$ given by $\tilde D [f] = [D f]$. You can check that this is well-defined; that is, that if you pick a different representative $g$ for the class $[f]$, that $[D f] = [D g]$. This is just the fact that the derivative of a polynomial function is again a polynomial function. An important thing to realize is that if $f \in C^\infty(\mathbb R)$ is such that $[f] = 0$, then $f$ is a polynomial function.

In some sense, this is a strictly bigger space than $\hat C^\infty(\mathbb R)$, because it's easy to check that the map $\hat C^\infty(\mathbb R) \rightarrow \tilde C^\infty(\mathbb R)$ given by $f \mapsto [f]$ is one-to-one (since $\hat C^\infty(\mathbb R)$ has no nonzero polynomial functions). But we've now got access to other important functions, like $f(x) = e^x$, so long as we're content to think of them not as actual functions, but as equivalence classes. Of course, if we impose initial conditions on a function and all derivatives at a given point, we can reconstruct any function from its equivalence class, and thus exactly solve differential equations that way.

The expanded set of functions does come at a price. We've lost the ability to multiply functions. To see this, note that $[0] = [1]$ as constant functions, but if we multiply both by $e^x$, $[0] \ne [e^x]$. This flaw is necessary for us, but it does make it harder to study this space in any depth. While $\hat C^\infty(\mathbb R)$ could be studied using algebraic techniques, we're less able to say anything really interesting about $\tilde C^\infty(\mathbb R)$.

The real big reason to use a space like this though (for me at least) is for something like asymptotic analysis. If you only care about the behavior of the function for very large arguments, it turns out often to not matter very much (in a precise way) exactly how you pick integration constants. The space we've constructed here is really only good for being able to solve differential equations of the form $D^n f = g$, but with some work you can extend this to other kinds of differential equations. With some more work (i.e. quotienting out functions which are irrelevant asymptotically), you can get rid of other functions which you don't care about asymptotically.

Such spaces are often used (implicitly and typically without rigorous definition) as a first approximation in theoretical physics and in certain types of research on ordinary differential equations. You can get a very course view of what the solution to a differential equation looks like without caring too much about the details. I believe (and am far from an expert on this) that they also have some application in differential Galois theory, and while that would be an interesting direction to go for this answer, it's probably too advanced for someone learning this for the first time, and certainly outside my expertise.

Option 3: Realize that $D$ not being invertible isn't necessarily a bad thing

This is the direction most good mathematics takes. Just stick with $D$ and $C^\infty(\mathbb R)$, but realize that the non-invertibility of $D$ isn't necessarily a very bad thing. Sure, it means you can't solve differential equations without some initial/boundary conditions, which is to say that you don't have unique anti-derivatives, but let's think about what that buys you. An interesting trend in modern mathematics is realizing that by forgetting information, sometimes you can make other information more manifest and hence easy to work with. So let's see what the non-invertibility of $D$ can actually tell us.


Let's first look at an application, and since I'm a physicist by day, I'll pick physics. Newton's second law is perhaps the most fundamental law in classical mechanics. For a particle travelling on a line, this looks like $\frac{d^2}{dt^2} x(t) = f(\frac{d}{dt}x(t) ,x(t), t)$. I've switched the variables to position for $x$ (the function of interest) and the independent variable as $t$. In general this equation isn't very easy to solve, but we note that it has two derivatives maximum, so we expect that (morally speaking) we'll have to integrate twice, and be left with 2 free parameters.

The two-dimensional space of all such solutions to a system is called "phase space", and as it turns out, this is a very fundamental concept for all of physics. We can take the parameters to be the position and velocity (or momentum) at the initial time. The fact that the full behavior of the system depends only on the initial position and velocity of the system is something that isn't emphasized enough in introductory courses, but is really fundamental to the way that physicists think about physics, to the point that anything involving three derivatives is viewed as confusing and often unphysical. Jerk (i.e. the third derivative of position) isn't expected to play any role at a fundamental level, and when it does show up (e.g. the Abraham–Lorentz force), it is a source of much confusion.

That isn't the end of the story though. We know that the world is better described by quantum mechanics than classical mechanics at short distances. In quantum mechanics, we don't have a two-dimensional phase space. Rather, the space of states for a particle on a line are "wave"functions $\mathbb R \rightarrow \mathbb C$. The magnitude squared describes the probability distribution for measurements of position of the particle, while the complex phase of the function describes the distribution of momenta in a precise way. So rather than having 2 real dimensions, we really only have 1 complex one, and use it to describe the distributions of both position and momenta. The Schrödinger equation (which is the quantum mechanical version of Newton's 2nd law) describes how the wavefunction evolves in time, but it involves only one time derivative. This process of cutting down the two-dimensional phase space to a 1-dimensional space (upon which you look at probability distributions) is a source of ongoing research among mathematicians to understand the full extend and conditions on which it can be performed; this program goes by the name of geometric quantization.


Let's move back to pure mathematics, to see how you can use the non-invertibility of the derivative in that context as well. We already know that the kernel of $D$ is just the set of constant functions. This is a 1-dimensional space; that is, $\dim \ker D = 1$. But what would happen if instead of looking at smooth functions $\mathbb R \rightarrow \mathbb R$, we remove $0$ from the domain? So we're looking at a function $f \in C^\infty((\infty, 0) \cup (0, \infty))$, and we want it to be in the kernel of the derivative operator on this space.

The differential equation $Df=0$ isn't hard to solve. You can see that $f$ needs to be locally constant just by integrating. But you can't integrate past $0$ since the function isn't defined there, so the general form of any such $f$ is

\begin{equation*} f(x) = \left\{ \begin{array}{lr} c_1 & : x <0 \\ c_2 & : x > 0. \end{array} \right. \end{equation*}

But wait! Now this function depends on 2 parameters; that is, $\dim \ker D = 2$ here. You can convince yourself that if you take the domain of the functions in question to be the union of a collection of $n$ disjoint intervals on $\mathbb R$, then a function solving $Df = 0$ for the derivative on this space is constant on each interval, and hence these form an $n$-dimensional space; i.e. $\dim \ker D = n$.

So, while we have to forget about the constant functions, we're getting information back in the form of the kernel of $D$. The kernel's dimension is exactly the number of connected components of the domain of the functions. This continues to be true even if we allow the domain to be much more complicated than just a disjoint union of lines. For any smooth manifold (i.e. a space which looks locally like Euclidean space but may be connected in ways which a line or a plane isn't, e.g. a circle or a sphere) $M$, if $D$ is the gradient on $C^\infty(M)$, then $\dim \ker D$ is the number of connected components of $M$. You may not yet know multivariable calculus, so this may go over your head a bit, but it's nonetheless true and important.

This may seem like a random coincidence, but it's a rather deep fact with profound generalizations for topology. Unfortunately, there's a limit to how much we can do with one dimension, but I'll push that limit as much as we can.

Let's look at smooth functions on a circle $S^1$. Now, of course, we already know how $D$ fails to be injective (one-to-one) on $C^\infty(S^1)$, and that the kernel is 1-dimensional, since a circle is connected. What I want to look at is the failure of surjectivity; that is, will $D$ be onto? Of course, what exactly we mean by differentiating on $S^1$ isn't obvious. You can think of $Df$ as a gradient vector which points in the direction which $f$ is growing proportional to the rate of growth at any given point. 1-dimensional vectors are (for our purposes, which are purely topological), just numbers, so we get a number at every point. This turns out to not be exactly the right way to think about this for the sake of generalizing it, but it's good enough for now.

As it turns out, when you go all the way around a circle, you need to end up exactly back at the same value of the function you started at. Let's parametrize points on the circle by angle, running from $0$ to $2\pi$. This means that, if we want $f = D g$ for some function $g \in C^\infty(S^1)$, we need $$\int_0^{2 \pi} f(\theta) d \theta = \int_0^{2 \pi} \frac{dg}{d\theta} d\theta = g(2 \pi) - g(0) = 0.$$ So the average value of $f$ needs to be $0$.

This means that $\operatorname{im} D$ isn't all of $C^\infty (S^1)$. It's just those functions with average value $0$. That wasn't the case when we had $\mathbb R$. Now, if you take a function $f$, you can decompose $f(x) = f_0 + f_d(x)$, where $f_0$ is the average of $f$, and $f_d(x)$ is the deviation from the average at the point $x$. $f_d$ is a function that has has $0$ average, and so it's in the image of $D$. So if we look at the quotient space $C^\infty (S^1) / \operatorname{im} D$, it's essentially just the space of constant functions on $S^1$, which is 1-dimensional. That is, $\dim C^\infty (S^1) / \operatorname{im} D = 1$.

At this point, we've actually come upon something pretty surprising. An arbitrary 1-manifold $M$ is necessarily a disjoint union of $n$ intervals and $m$ circles, for some values of $n$ and $m$ (which could be infinite, but I'll ignore this). And we can classify these just by understanding what $D$ does on $M$; specifically, $\dim \ker D = n + m$ is the number of components, and $\dim C^\infty (S^1) / \operatorname{im} D = m$ is the number of circles.

It turns out that one can generalize this to not only measure circular holes in a manifold, but spherical holes of other dimensions. The generalization is neither obvious nor easy, but it does take you to something rather close to modern mathematical research. Specifically, you'd end up at the de Rham cohomology of a manifold. This is a way to compute topological invariants of your manifold $M$ which depend just on the (non)existence to solutions of certain simple classes of differential equations in some unrigorous sense the de Rham cohomology (or more precisely the Betti numbers, which are the dimension of the cohomology) counts the number of $n$-dimensional holes of $M$ for each nonnegative integer $n$. This answer is already too long and the amount of material to construct this is far too much to include here, but you may take a look at these blog posts to get some more information on this for starters: More than Infinitesimal: What is “dx”? and Homology: counting holes in doughnuts and why balls and disks are radically different..

Anyway, modern-day research in algebraic topology has progressed far enough that we no longer particularly need to construct cohomology this way; there are a plethora of other options and a great number of generalizations. But this is still among the most tractible and intuitive ones. There are many further applications of these things, the most incredible (for me at least) of which is probably the Atiyah-Singer index theorem, but that's unfortunately far too advanced to describe here.


So, for me anyway, the question at the end of the day isn't "Why do we define the derivative so that it isn't invertible", so much as "How can we use the fact that the derivative is not invertible to our advantage to do more mathematics?". I've really only given a couple of examples, but hopefully they're enough to convince you that this non-invertibility is actually not entirely a bad thing, and that it can be used to our advantage if we're intelligent about it.