Real Analysis question that affects how to think about the Dirac delta function.

First, a comment about the property you mention. It's true that if $f$ and $g$ are measurable functions and $f = g$ almost everywhere, then

$$ \int_Ef(x)dx = \int_Eg(x)dx $$ This won't hold if $f$ and $g$ are distributions, because as you say distributions aren't functions. So the integral notation is just that - it's notation. Now...

A1. The answer to your first question comes from wanting the theory of distributions to functionally "look like" the theory of nicer linear functionals, i.e. we were really good at manipulating integrals, so we wanted the new thing to operationally work the same. For physicists, who were used the the Riesz representation theorem, this meant wanting to think of linear functionals as being just integrals against a fixed function. So, hence the inner product/integral notation

$$ F(\varphi) = \langle F,\varphi\rangle = \int F(x)\varphi(x)dx $$ Note that in distribution theory, this is not a standard Lebesgue integral - it's just formal notation. The integral notation is ubiquitous though - it's not just linear functionals that are written this way, but also linear operators

$$ g(x) = \int k(x,y)f(y)dy $$ There is even a sort of generalization of the Riesz representation theorem called the Schwartz Kernel Theorem that says that any (nice enough) linear operator $g = Ky$ can be written using an "integral" like that, but where the kernel function $k(x,y)$ is possibly a distribution. The moral of the story is, you should extend your understanding of the integral notation to include other linear operations, not just integration of standard functions. Once you've proved that all the usual operations that you're used to, like integration by parts, make sense with distributions, you'll see that using the integral notation is very natural and 100% rigorous - as long as you remember that it's just notation for "apply the linear operation specified".

A2. It isn't fatal at all to think of the delta function in this way - in fact this is a preferred method to define the delta function and many other distributions. The rect function represents a sort of local averaging, and you can think of the delta function as being an "infinitely local" averaging (i.e. sampling). The one thing I would recommend is looking into "approximations to the identity" - the rect function construction is just one possible construction, and in order to show that the delta function is uniquely defined, one should show that any similar sequence of approximate deltas also gives the same result (e.g. triangles, Gaussians, etc). In other words, you could either define the delta function as "that linear functional such that $F(\varphi) = \varphi(0)$, in which case you need to show that this is a well-defined, bounded linear operation on some function space, or alternately you could define the delta function as "the limit of $\langle \delta_\epsilon,\varphi\rangle$ as $\epsilon\rightarrow 0$, in which case you still need to prove that this is a well-defined, linear operation on some function space. Either way, the result is the same.