Where is the wild use of the Dirac delta function in physics justfied?
Solution 1:
To answer Is there a math book written by a mathematician (not a physicist) which treats much of the above rigorously?
The following references are my favorites:
a) "Mathematics for the Physical Sciences", Laurent Schwartz;
b) "Generalized Functions vol 1", I.M. Gelfand, G. E. Shilov.
These are classics and primary sources in the areas of generalized functions, b), and distributions, a). Generalized functions and distributions are the same thing, see wiki 'generalized functions'. Both are rigorous math books, and are very readable. They both have much information on the Dirac Delta distribution (aka 'Delta Function').
Schwartz is credited with originating the 'theory of distributions' which is also the title of his original book (in French only). "Mathematics for the Physical Sciences" contains much of the material in that book.
Gelfand, a master mathematician, goes into even more detail. A substantial portion of vol 1 is devoted to the Dirac Distribution.
To answer Alternatively, if you can justify that the above properties are just physics (not math) ...
The properties are math not physics.
Solution 2:
Associated to a function $f$ on a domain $D$ there is a linear operator given by $$g \mapsto \int_D f(x) \, g(x) \, dx$$ If we have a point $0 \in D$ then there is also a linear operator given by $$g \mapsto g(0)$$ and in many ways this behaves very much like a linear operator of the previous kind. For one thing, if you take a sequence of compact domains $C_i \to \{0\}$ and consider the "average value of $g$ on $C_i$" linear operator $$g \mapsto \int_D \frac{1_{C_i}(x)}{{\rm vol}(C_i)} g(x) \, dx$$ associated to the normalized indicator function $$f(x) = \frac{1_{C_i}(x)}{{\rm vol}(C_i)}$$ then this should obviously converge to the operator $g \mapsto g(0)$, at least assuming that things are set up so that convergence works properly. So we can imagine the linear operator $g \mapsto g(0)$ being associated to a "generalized function" $\delta(x)$, so that
$$``\int_D \delta(x) g(x) \, dx\text{''} := g(0)$$
You then just proceed to define "generalized functions" (or "distributions") to be objects having the desired properties, while in the background you're really just replacing the notion of a function $f$ with the associated linear operator [1] $$g \mapsto \int_d f(x) \, g(x) \, dx$$
That's really everything you need to know. Everything else just comes down to picking exactly what context you want to work in and choosing the things that make sense there -- if you want to use a larger space of test functions, you just have to restrict the class of functions $f$ you allow yourself to consider. But this just has to do with the functions (or "functions") that $\delta$ is going to sit alongside; $\delta$ itself works under pretty much any circumstances, since it doesn't require any notion of convergence to define.
UPDATE: Knowing the above, the proofs of most of the statements listed in the question are routine calculations. You can find the definitions of all these things in ay fuctional analysis text and simply plug in the dirac delta. For instance, by definition the Fourier transform of a function is
$$\hat{f}(s) = \int_{-\infty}^\infty f(x) e^{-2 \pi i x s} \, ds$$
If we regard a function $f$ as corresponding to linear operators $F$ where
$$F(g) := \int_{-\infty}^\infty f(x) g(x) \, dx$$
This leads us to define
$$\hat{f}(s) := F(e^{-2 \pi i x s})$$
where "f" can be anything we associate a linear operator $F$ to. Remembering that $\delta$ is just a formal symbol corresponding to the linear operator $L(g) := g(0)$, we have
$$\hat{\delta}(s) = L(e^{2 \pi i x s}) = 1$$
Similarly, if $f$ is a differentiable function then we can consider the linear operator associated to $f'$,
$$g \mapsto \int_{-\infty}^{\infty} f'(x) \, g(x) \, dx = - \int_{-\infty}^{\infty} f(x) g'(x) \, dx$$
where the equality follows from integrating by parts, using the fact that we're necessarily working in some context where $\lim_{x \to \pm \infty} f(x) g(x) = 0$. So the linear operator associated to $f'$ is
$$g \mapsto - \int_{-\infty}^{\infty} f(x) g'(x) = - F(g')$$
so we choose to take this as the definition of the derivative of something we can associate a linear operator to. In the case of the dirac delta function, $\delta'$ denotes the thing that associates to the linear operator $g \mapsto g'(0)$.
[1] If you prefer measure theory to functional analysis, you might instead think of replacing the function $f(x)$ with the measure $\mu(x) = f(x) \, dx$. Then the $\delta$ "function" is merely a formal notation such that $\delta(x) \, dx$ denotes a point mass measure centered at zero. It amounts to the same thing, since ultimately what you do with a measure is integrate something with respect to it.
Solution 3:
The term you're looking for is distribution theory. In the language of distributions, it is extremely simple to make the Dirac delta "function" rigorous, and to prove the aforementioned properties.
Here's the basic notion of a distribution:
A distribution is a continuous linear map from a set of nice functions (called "test functions") to $\mathbb{R}$.
Notice, by the way, that this means distributions are actually honest-to-satan functions. However, they're functions that eat other functions, which makes them somewhat different from, say, functions on the real line. For one thing, it's probably not immediately clear how to define a "derivative" or anything else. Once we look at the details, we'll find a way around this pretty quickly.
When we pick different sets of test functions, we get different notions of "distribution." To begin with, let's choose our space of test functions $D$ be the set of infinitely differentiable functions $\mathbb{R}^d \to \mathbb{R}$ that have compact support (that is, we require the functions to be zero except on some compact set). We need some topology on $D$ in order to make sense of the term "continuous." (If you're not familiar with topologies and convergence, skip the next line for now.) The topology on $D$ is usually given by specifying what convergence means on $D$: we will say that a sequence of elements $\varphi_k$ in $D$ converges to $\varphi$ as $k \to \infty$ if and only if every derivative of $\varphi_k$ converges uniformly to the corresponding derivative of $\varphi$ and all the $\varphi_k$ have supports contained in a common compact set.
An example of a distribution is the map $D \to \mathbb{R}$ given by $$\varphi \mapsto \varphi(0)$$ You can check that this is a continuous linear map. This map is the Dirac delta "function" $\delta$.
Another example: Say we have a locally integrable function $f: \mathbb{R}^d \to \mathbb{R}$. Then we can define another distribution $$\varphi \mapsto \int f(x) \varphi(x) dx$$ Now this is linear in $\phi$, and is continuous. I'll write $(f, \varphi)$ for this distribution.
Perhaps somewhat confusingly, when $F$ is a distribution (not a locally integrable function like $f$) we often use the conflicting notation $(F, \varphi)$ to indicate some distribution applied to $\varphi$.
Keep in mind that the first example above cannot be written in the form of the second example, i.e. as integration against a locally-integrable function, but is nonetheless the notation that is often used, particularly in physics: $\int \delta(x) f(x) dx = f(0)$. This is a pretty common sleight of hand: we pretend that distributions are given by integrating against a nice function even though not all distributions can be written this way.
Now we want to define a notion of "derivative" for distributions. Since a distribution is a function from a space of functions to the real numbers, it's not immediately clear how to do this. Let's try that aforementioned sleight-of-hand: consider the distributions of the form $\varphi \mapsto \int f(x) \phi(x) dx$ for some locally-integrable $f$.
From the usual integration by parts formula from ordinary calculus,
$$\int \partial_x^{\alpha} f(x) \varphi(x) dx = - \int f(x) \partial_x^{\alpha} \varphi(x) dx$$
(Note that the usual boundary terms in the integration-by-parts formula go away because $\varphi$ has compact support.)
To put this back into the notation from above: $(\partial_x^{\alpha} f(x), \varphi) = - (f(x), \partial_x^{\alpha} \varphi)$. So this suggests a way to define "differentiation:" let's use this notation as a definition.
That is, for any distribution $F\colon D \to \mathbb{R}$, we define the distributional derivative $\partial_x^{\alpha} F$ by $$(\partial_x^{\alpha} F, \varphi): = -(F, \partial_x^{\alpha} \varphi)$$
For example, let's consider the distribution given by (integrating against) the Heaviside function (we're taking $\mathbb{R}^d$ in the definition of $D$ to be $\mathbb{R}^1$):
$$ H(x) = \begin{cases} 1 &(x >0)\\ 0 &(x\leq 0) \end{cases} $$
Like in the second example, the distribution defined by $H$ is $(H, \varphi) = \int H(x) \varphi(x) dx$. As an exercise compute the derivative of this from the definition (the answer is at the bottom).
So to recap: A distribution is a continuous linear map from a set of nice functions (called "test functions") to $\mathbb{R}$. A dirty trick we will use again and again in distribution theory is to systematically confuse a function $f$ and the distribution given by integrating against it. Using this trick, we can use relatively basic mathematics to understand what certain notions like integration ought to mean for distributions, and then take this to be the definition. I've shown how to do this with (partial) derivatives; you can do the same with convolutions, adjoints, and more.
The above is just meant to give you a small flavor of the subject, so I won't go any further, and most good analysis texts should have more details if you seek them. A readable source (though not one I personally favor) is Stein and Shakarchi's Functional Analysis, Chapter 3.
Answer: The distributional derivative of this is:
$$(H', \varphi) = - (H, \varphi') = -\int H(x) \varphi'(x) dx = - \int_{0}^\infty \varphi'(x) dx = \varphi(0) - \lim_{\alpha \to \infty} \varphi(\alpha) = \varphi(0)$$
Notice that the Dirac delta "function" (distribution) applied to $\varphi$ gives precisely the same thing! (Hence the common confusing claim in intro physics classes: "the Dirac delta is just the derivative of the Heaviside function.")