Rigorous derivation/explanation of delta function representation?
I'll assume you know a little bit of Fourier analysis and functional analysis, and basic algebra. The space $L^1(G)$ of Lebesgue integrable functions is a Banach space for any locally compact abelian group -- if that doesn't make any sense, just think $G = \mathbb{R}$ and Banach space $=$ space of functions. The important thing is that if I equip this space with a multiplication $*$, given by convolution, $$(f*g)(x) = \int_G f(y)g(x-y) dy,$$ it becomes a Banach $\textit{algebra}$, so multiplication is associative (and commutative! $f*g = g*f$).
Cool! But wait, there's a problem -- there's no unit. There is no $L^1$ function $h$ such that $(f*h)(x) = f(x)$ for all $x \in G$.
Here's where our friend $\delta$ comes in. The Dirac delta "function" is the unit in this algebra: $$(\delta*f)(x) = \int_{-\infty}^\infty \delta(y)f(x-y) dx = f(x-y)\bigg|_{y=0} = f(x).$$ (One can be fairly rigorous of this by taking a limit of functions in $L^1$ -- $C^\infty$ in fact! -- which approximate the delta function and are called "kernels.")
Great, so how does this explain your mysterious formula for $\delta$? Well, what would you guess is the Fourier transform of $\delta$? [Stop. Think!] In $L^1(G)$ multiplication is given by that crazy guy Convolution, but for functions on the dual group (the "Pontryagin" dual, denoted $\widehat{G}$) multiplication is good 'ol multiplication. This dual space is where $\xi$ lives, and $\hat{f}$ is a function on it. This space is also what physicists call 'frequency' space (as opposed to 'position' space) for obvious and good reasons. So how does that help us? Well, we'll borrow one more thing from the Devil of Algebra (though the Angel of Topology is actually doing the heavy lifting in the background -- but that story's for another time): homomorphisms.
It is perhaps no surprise that the Fourier transform $\mathcal{F} : f \rightarrow \hat{f}$ is a homomorphism. This is the typically-seen, less-often-understood statement that $$\widehat{f*g}(\xi) = \widehat{f}(\xi)\widehat{g}(\xi).$$
Why do we care? Well, waddyaknow about homomorphisms? They take the identity to the identity. We know the identity on $L^1(G)$ -- it's our beloved/hated $\delta$ wannabe function -- and we know that if $f(\xi)g(\xi) = f(\xi)$ for all $\xi$ then $g(\xi) \equiv 1$. In other words, $\mathcal{F}(\delta) = 1$. A quick "pseudo-computation" (this is, like, hella wrong) proves this: $$\hat{\delta}(\xi) = \int_{-\infty}^\infty \delta(x)e^{-i\xi x} dx = e^{i\xi x}\bigg|_{x = 0} = 1.$$ Ugh, I can't believe I just wrote that.
At any rate, you can see how this leads to your question. You'd technically want to inverse transform back the constant function 1, via the inverse Fourier transform: $$\mathcal{F}^{-1}(f(x)) = \frac{1}{2\pi}\int_{-\infty}^\infty \hat{f}(\xi)e^{i\xi x} dx$$ which in the case of the unit $ f \equiv 1$ is your formula. Here, $\xi = p$, and just replace $x$ by $x-a$. Just don't ever use that thing before taking a course on this stuff.
Finally, the normalization constant $\frac{1}{2\pi}$ is entirely dependent on you choice of definition of Fourier transform. Some people have one direction $\frac{1}{2\pi}$ and the other direction $1$, others $\frac{1}{\sqrt{2\pi}}$ for both, but I always just set $2\pi = 1$.
Hope that helped, ask if you like (though nicely).
Edit to answer question in comment, i.e. why am I evaluating at $x = 0.$
It's for the same reason I evaluated the convolution at $y = 0$. Remember, $\int \delta = 1$ and $\delta(0) = \infty$. Suppose you have a sequence of positive $C^\infty$ functions $\eta_n$ such that 1) $\int_R \eta_n(x) dx = 1$ for all $n$, 2) $\lim_{n\rightarrow\infty} \int_{|x| > \epsilon} \eta_n(x) dx = 0$ for all $\epsilon > 0$. Then $\{\eta_n\}$ is a family of good kernels. They approximate the delta function [it is clear from 2) that $\eta_n(x)$ must $\rightarrow \infty$ at around 0, since the function integrates to 1 on all of $R$ and 0 outside of an epsilon ball centered around 0].
Let $\epsilon > 0$ be arbitrary and $f \in L^1$. In $L^1$, the map $y \rightarrow f(x-y)$ is uniformly continuous by translation invariance of Lebesgue measure (hint: approximate by a continuous function). Let $\delta > 0$ be such that $\sup_{|y| \leq \delta} |f(x)-f(x-y)| < \epsilon/2.$ Set $M = \|f\|_1$, and choose $N$ so large that $\int_{B(0,\delta)} \eta_n < \epsilon/4M$ for all $n \geq N$. \begin{align} |(\eta_n * f)(x) - f(x)| &= \left| \int_{-\infty}^\infty \eta_n(y)f(x-y) dy - f(x)\right| \\ &= \left| \int_R \eta_n(y) f(x-y) dy - \int_R \eta_n(y) f(x) dy \right| \tag{since $\int_R \eta_n = 1$} \\ &\leq \int_R |\eta_n(y)||f(x-y)-f(x)| dy \\ &= \int_{B(0,\delta)} |\eta_n(y)||f(x-y)-f(x)| dy \\ &+ \int_{R \setminus B(0,\delta)} |\eta_n(y)||f(x-y)-f(x)| dy \\ &\leq \frac{\epsilon}{2}\|\eta_n\|_1 + \frac{\epsilon}{4M}2\|f\|_1 = \epsilon. \end{align}
In some sense, the delta function "picks out" the value of $f(x)$ at $0$ when you convolve against it (remember, convolution is a weighted average).
I admit that's not very clear. It certainly didn't give me intuition. I much prefer to see it from the discrete case. Say we have a discrete group $G$ (countable number of elements). Define convolution as
$$(f*g)(x) = \sum_{y \in G} f(y)g(x-y)$$ (notice the resemblance to multiplying polynomials together...)
In this case, you DO have a unit, namely: $$e(x) = \begin{cases} 1 &\text{ if } x = 0 \\ 0 &\text{ otherwise } \end{cases}.$$
We have: \begin{gather} (f*e)(x) = \sum_{y \in G} f(y)e(x-y) = 0 + 0 + ... + f(x-1)e(1) + f(x)e(0) + f(x+1)e(-1) + 0 + \dots \end{gather} I find this makes it clearer to me that all terms die except at $e(0)$. (BTW, here $0 = \mathbb{1}_G$).
All this stuff is really beautiful, and I encourage you to explore more about Fourier analysis, abstract harmonic analysis, representation theory, spectral theory ... (For instance, one reason you integrate against exponentials in the Fourier transform is because they are eigenfunctions of the translation action $x \rightarrow y + x$ -- which is, by the way, a left action of the group $G$ on itself). Here's a brief bit on this stuff (the Gelfand transform) that I wrote up a while ago -- http://wtfmathematics.com/wp-content/uploads/2013/05/fourier.pdf and perhaps better (with pictures!) http://wtfmathematics.com/wp-content/uploads/2013/05/snarski_gelfand.pdf
The 'simple justification' in Physics lectures is as follows:
\begin{align} {\rm f}\left(x\right) &= \int_{-\infty}^{\infty}\tilde{\rm f}\left(k\right){\rm e}^{{\rm i}kx}\, {{\rm d}k \over 2\pi} \\[3mm] \tilde{\rm f}\left(k\right) &= \int_{-\infty}^{\infty}{\rm f}\left(x\right){\rm e}^{-{\rm i}kx}\,{\rm d}x \end{align} $$ {\rm f}\left(x\right) = \int_{-\infty}^{\infty} \left[% \int_{-\infty}^{\infty} {\rm f}\left(x'\right){\rm e}^{-{\rm i}kx'}\,{\rm d}x'\, \right] {\rm e}^{{\rm i}kx}\,{{\rm d}k \over 2\pi} = \int_{-\infty}^{\infty}{\rm f}\left(x'\right)\left[% \int_{-\infty}^{\infty} {\rm e}^{{\rm i}k\left(x - x'\right)}\, {{\rm d}k \over 2\pi} \right] {\rm d}x' $$
"Compare" $\ $ with $$ {\rm f}\left(x\right) = \int_{-\infty}^{\infty}{\rm f}\left(x'\right)\delta\left(x - x'\right)\,{\rm d}x' $$
It's useful as a pnemotecnic rule. For the rigorous stuff, see @snarski .
It might be too late but I thought I share my justification (non-rigorous) for it:
\begin{eqnarray} \int_{-\infty}^{+\infty}\int_{-\infty}^{+\infty}e^{ikx}dkf(x)dx &= \lim_{K\rightarrow +\infty}\int_{-\infty}^{+\infty}\left(\frac{e^{ikx}}{ix}\right)_{-K}^K f(x)dx \\ &= \lim_{K\rightarrow +\infty}\int_{-\infty}^{+\infty}\frac{2i\sin(Kx)}{ix}f(x)dx \\ &= \lim_{K\rightarrow +\infty}\int_{-\infty}^{+\infty}\frac{2\sin(y)}{y}f(y/k)dy, y=Kx \\ &= \int_{-\infty}^{+\infty}\frac{2\sin(y)}{y}\lim_{K\rightarrow +\infty}f(y/k)dy \\ &= \int_{-\infty}^{+\infty}\frac{2\sin(y)}{y}f(0)dy \\ &= 2f(0)\int_{-\infty}^{+\infty}\frac{\sin(y)}{y}dy \\ &= 2\pi f(0) \end{eqnarray}
where I used $e^{iKx} = \cos(iKx)+i\sin(iKx)$