Intuition for the Importance of Modular Forms
Solution 1:
The theory of modular forms arose out of the study of elliptic integrals (as did the theory of elliptic curves, and much of modern algebraic geometry, and indeed much of modern mathematics). People understood that (complete) elliptic integrals (which we would think of as the number obtained by integrating a de Rham cohomology class, e.g. the one associated to the holomorphic differential on an elliptic curve, over a homology class on the curve) depended on an invariant (what we would think of as the $j$-invariant of the elliptic curve, although historically people used other invariants, often depending on some auxiliary level structure, such as $\lambda$, or $k$ (the square-root of $\lambda$)). This invariant was called the modulus (which is the origin of the adjective modular in this context).
People knew that if you replaced an elliptic curve by an $N$-isogenous one, then the elliptic integral would be multiplied by $N$ (in terms of $\mathbb C/\Lambda$, the elliptic integral is just one of the basis elements for $\Lambda$, and multiplying this by $N$, while keeping the other one fixed, gives a new elliptic curve related to the original one by an $N$-isogeny). They asked themselves how they could describe the modulus for this $N$-isogenous elliptic curve (or integral) in terms of the original one. This led them to find explicit equations for the modular curves $X_0(N)$ (for small values of $N$).
With these kinds of investigations (and remember, these were brilliant people --- Jacobi, Kronecker, Klein, just to mention some spanning a good part of the 19th century), it was natural that they were led to modular forms as well as modular functions (as one example, the Taylor coefficients of elliptic functions give modular forms; as another, the coordinates --- say with respect to Weierstrass elliptic functions --- of $N$-torsion points give level $N$ modular forms).
So all these investigations grew out of the study of elliptic integrals, but became intimately connected with the invention of algebraic topology, the development of complex analysis (by Riemann, and then Schwarz, and then the uniformization theorem), the development of hyperbolic geometry; basically all the fundamental mathematics of the 19th century that then drove much of the developments of 20th century mathematics.
The connections with arithmetic were also observed early on. Jacobi already introduced theta series and saw the relationship with counting representations by quadratic forms (e.g. he proved that the number of ways of writing $n \geq 0$ as a sum of four squares is equal to $\sum_{d | n, 4 \not\mid d} d$, using weight $2$ modular forms on $\Gamma_0(4)$).
But Kronecker (and maybe Abel, Eisenstein and even Gauss before him) also knew that modular forms, when evaluated at CM elliptic curves (i.e. at quadratic imaginary values of $\tau$) gave algebraic number values in some contexts. Gauss was led to this by the analogy with cyclotomy: $N$-torsion on an elliptic curve was analogous to $N$th roots of $1$ on the unit circle, and the analogy is tighter when the elliptic curve has CM, because then the $N$-torsion points become a cyclic module over the ring of CMs, just as the $N$th roots of $1$ are a cyclic module over $\mathbb Z$ (i.e. a cyclic group).
Kronecker (and again, maybe people before him) realized that CM elliptic curves corresponded to lattices $\Lambda \subset \mathbb C$ that belong to ideal classes in quadratic imaginary fields, and so saw a relationship between CM elliptic curves and class field theory for quadratic imaginary fields (Kronecker's Jugendtraum). This also related to the previous work on evaluating modular forms at CM points.
All this is just to say that even in the 19th century the subject was very deep, and already very connected to number theory, as well as everything else.
Ramanujan knew the theory very well, and discovered new phenomena (e.g. his conjectures on the behavious of $\tau(n)$, defined by $\Delta = q\prod_{n=1}^{\infty} (1- q^n)^{24} = \sum_{n=1}^{\infty} \tau(n) q^n$). Mordell proved Ramanujan's conjecture on the multiplicative nature of $\tau$, and Hecke introduced his operators to systematize Mordell's method of proof.
At this point, the subject moved in a more representation-theoretic and analytic direction, with the generalization to automorphic forms. With the discovery in the 50s, 60s, and 70s of the modularity conjecure for elliptic curves over $\mathbb Q$, and related ideas, the arithmetic theory of modular forms became a central topic again. See this answer on MO for more on that.
Mazur's theorem on torsion points on elliptic curves over $\mathbb Q$ is one of the deepest results that comes from thinking of $X_0(N)$ and $X_1(N)$ directly in modular terms. But already the proofs are more automorphic in nature, and are focussed on the relationships between modular forms, particularly Hecke eigenforms, and Galois representations. That's where the modern focus primarily lies. You can see some of the other answers linked from my webpage (here) for more on that.
Let me close this long discussion by just saying that the passage to Galois representations as a focus is a natural development from Kronecker's Jugendtraum, but reflects a shifting of attention from abelian class field theory for quadratic imaginary fields to non-abelian (more precisely, $\mathrm{GL}_2$) class field theory for $\mathbb Q$. (Note that the former embeds in the latter, since the indcution of a Galois character of a quadratic extension gives a two-dimensional rep. of $G_{\mathbb Q}$.)
Finally, let me mention that the main theme of Mazur's article is congruences between cuspforms and Eisenstein series (this is what the Eisenstein ideal measures), and so it's hard to have one without the other. (In some sense, Eisenstein series are like the trivial Dirichlet character mod $N$, while cuspforms are like the non-trivial characters. Which is more important depends on what you are doing; in many problems you need to consider both.)
Solution 2:
I don't have very specific answers to your questions (some of these might be better answered by someone with more background in complex geometry), but I think that I can address some aspects of the importance of modular forms for number theory.
To understand, I think that a little historical perspective is always good to have. This is not exactly a direct answer but please bear with me.
In his foundational paper on algebraic number theory, Riemann expressed the "completed zeta function" $\Lambda(s) = \Gamma(s/2) \pi^{-s/2}\zeta(s)$ as the Mellin transform of $(\theta(\tau)-1)/2$ , where $\theta(\tau) = \sum_{n \in \mathbf Z} q^{n^2}$ ($q=e^{2\pi i \tau}$) is the Riemann theta function. By the rapid convergence of this series for $\text{Im} \tau > 0$, the theta function is holomorphic in the upper-half plane, and it follows from the Poisson summation formula that $\theta(-1/\tau) = (-iq)^{1/2}\theta(q)$. By using this functional equation, Riemann proved that $\Lambda(s)$ extends to a holomorphic function on $\mathbf C$ and satisfies the functional equation $\Lambda(s)=\Lambda(1-s)$.
Now $\theta$ is obviously periodic of period $1$, so we see that it transforms nicely under the subgroup of $\text{SL}_2(\mathbf Z)$ generated by $\tau \mapsto \tau+1$ and $\tau \mapsto -1/\tau$. As you probably know, this subgroup maps isomorphically onto $\text{PSL}_2(\mathbf Z)$. Thus we see that $\theta$ is a modular form of weight $1/2$ for the full modular group, with the extra "character" $\chi\left(\begin{matrix} a & b\\ c & d\end{matrix}\right) = (-i)^{c/2}$.
In the $19^{th}$ century, following the pioneering work of Euler and Fagano on the transformation properties of elliptic integrals, mathematicians, led by Weierstrass and Jacobi, studied elliptic curves over $\mathbf C$. It was well understood then that these objects could be thought of either as smooth projective cubics over $\mathbf C$ or as complex tori of dimension $1$. During this period, the first modular forms of integral weight were discovered in the "invariants" of elliptic curves over $\mathbf C$.
Poincaré was the first to consider seriously elliptic curves over $\mathbf Q$. Poincaré conjectured that if $E/\mathbf Q$ is an elliptic curve, then $E(\mathbf Q)$ is a finitely generated abelian group. Some examples of this had already been supplied unknowingly by Fermat. It was proven some years later by Mordell and eventually for elliptic curves over any number field by Weil.
Weil, in 1949, formulated his famous conjectures about the zeta functions of smooth algebraic varieties over finite fields, and proved them in the case of curves. This led Hasse to define the zeta function of a smooth algebraic variety over $\mathbf Q$, and in particular of an elliptic curve. For an elliptic curve $E/\mathbf Q$, he defined $L(E, s)$ in the following way: by reducing $\mod p$ an integral model of $E$ for each prime $p$ not dividing $\Delta(E)$, he obtained an elliptic curve over $\mathbf F_p$ for each such $p$; he defined $L_0(E, s)$ as the product of the local zeta functions evaluated at $p^{-s}$. He conjectured the convergence of this product for $\text{Re }s > 3/2$, and conjectured a functional equation for it. For elliptic curves with complex multiplication, he was able to prove this conjecture essentially by class field theory over an imaginary quadratic field. This proof, like Riemann's, once again involved modular forms of half-integral weight.
(Hasse was well aware, however, that his $L$-function was missing factors for the primes dividing $\Delta$. It is only with the work of Grothendieck and his school that the missing factors could be accounted for, by interpreting $L(E,s)$ as the Artin $L$-function of the $\mathcal l$-adic cohomology of $E$.)
In the fifties and sixties, Shimura, Taniyama and Weil conjectured that the $L$-function of any elliptic curve over $\mathbf Q$ should come from a modular form. More precisely, they conjectured that given an elliptic curve of conductor $N$, there exists a Hecke newform $f$ of weight $2$ and level $N$ such that $L(E, s) = L(f, s)$. If that were true, then the analytic continuation and functional equation of $L(E,s)$ would follow directly from that of $f$, in the same spirit as for Riemann's proof. This is the celebrated theorem of Wiles, Breuil, Conrad, Diamond and Taylor, of which Fermat's last theorem is a consequence.
Eichler and Shimura provided a construction going in the other direction - namely, given a Hecke eigenform of weight $2$ and level $N$, they constructed an elliptic curve over $\mathbf Q$ such that $L(E,s) = L(f, s)$ (they found this curve sitting inside the Jacobian variety of $X_0(N)$).
Since the Hasse-Weil $L$-function of an elliptic curve over $\mathbf Q$ only depends on its isogeny class, there is a correspondence
$$\{E/\mathbf Q \text{ an elliptic curve of conductor $N$}\}/{\text{isogeny}} \cong \{\text{normalized newforms in $S_2(\Gamma_0(N))$}\}.$$
Since $S_2(\Gamma_0(N)) \cong \Omega^1(X_0(N))$ by the map $f \mapsto f \: d\tau$, the genus of $X_0(N)$ is at most equal to the number of isogeny classes of elliptic curves over $\mathbf Q$ of conductor $N$. For example, when $N=1$, $X_0(1) \cong \mathbf P^1$, so there are no elliptic curves of conductor $1$ (already a not-so-trivial fact).
So even if we restrict ourselves to forms of weight $2$ on $\Gamma_0(N)$, there are a couple of hundreds of years of mathematics to be learned. Many, many things to say.
In fact, one can prove that the modular curve $X_0(N)$ admits a smooth model over $\mathbf Z[1/n]$ (and in particular over $\mathbf Q$). This should be a very surprising fact - indeed, the curve $X_0(N)$ is initially defined over $\mathbf C$, i.e. it's a Riemann surface, and there is no reason a priori why it should admit a model over $\mathbf Q$ (in other words, the functor from smooth curves over $\mathbf Q$ to Riemann surfaces is neither full nor essentially surjective). This extra God-given arithmetic data is what makes these objects so rich and fascinating (peace be upon his Noodly Appendage).
Anyways - that's just a little bit of what we can say. If you'd like a good read, I recommend Rational Points on Modular Elliptic Curves by Henri Darmon. I'm reading through it myself and I share your fascination!
Solution 3:
Some brief comments on your three questions.
Why care about all modular forms and not just, say, cusp forms? Well, why in real analysis do we use all real numbers if, in practice, so few of them are really of direct interest (not so many classical transcendental constants come up, just $\pi$, $e$, Euler's constant,$\Gamma(1/4)$,...)? The answer is that we need the whole real line to get a lot of the machinery of real analysis to work. Likewise, even if ultimately you may care more about cusp forms than other modular forms, the cusp forms are pretty subtle objects and it is usually easier to write down examples of modular forms first that may not necessarily be cusp forms. Differences of such modular forms may then turns out to be cusp forms, so the extent that it's subtle to write down cusp forms directly, you can still get at some examples of them using other modular forms.
See Serre's article "Modular forms of weight one and Galois representations" in the 1977 Durham conference proceedings Algebraic Number Theory (edited by Froehlich, but note this is not Cassels & Froehlich).
One of the great theorems about elliptic curves over $\mathbf Q$ is Mazur's theorem classifying all possible torsion subgroups in $E({\mathbf Q})$. The theorem amounts to a description of the rational points on certain modular curves, which is obtained using the Jacobians of these curves. An understanding of the geometry of these objects is a prerequisite for understanding the rational points on them. (The later work of Kamienny and Merel on uniform boundedness of torsion on elliptic curves over general number fields was an extension of Mazur's approach.)