What's an intuitive explanation for integration?

I have never taken a formal calculus course, however I know some of the basics like differentiation and limits. I'm currently reading a book that is mathematically intensive, and I come across integration notation quite often but lack an intuitive understanding to be able to apply it to the physical world, or in this case to understand what is going on. I have read online explanations and it seems that they discuss finding the area under the curve but lack a fundamental explanation. I would appreciate if someone could explain integration to me for the sake of gaining an intuitive understanding of it so I can apply it to physical concepts.

"Area under the curve" only works if you have been trained (or trained yourself) to associate the area under some curve with the phenomenon you're trying to understand. This works fine as a visualization of the mathematics of an electromagnetic effect if you already understand that effect mathematically. That does seem a bit backwards.

In other words, I think you have an excellent question.

You have gotten at least one good answer. The following is (or should be, if I explain it correctly) completely consistent with the other answer(s), just from a slightly different perspective.

Consider a physical property such as the charge in a capacitor. We start with a capacitor with a certain amount of charge on each side and discharge it slowly through some device. While it's discharging, we keep track of the remaining charge on one side; let's say we choose the side with a negative charge. As you may imagine, we can consider the charge to be a function of time, which we might name $Q(t).$ The derivative of that function with respect to time, $\frac d{dt} Q(t),$ is the rate at which the charge changes (positive in this case, since charge is getting less negative), that is, the current, $I(t),$ flowing into that side of the capacitor at each instant in time.

Now, the integral of a function is often called an antiderivative. If you can understand how a physical quantity (for example, $Q(t)$) relates to its derivative (for example, $I(t)$), then you can work that relationship in reverse. Instead of measuring $Q(t)$ as a function of time and computing $I(t)$ from it, measure $I(t)$ as a function of time and compute $Q(t)$ from it. If you know the charge at all times, you can reconstruct the measurement of current; if you know the measurement of the current at all times, you can (in a sense) reconstruct the charge. More precisely speaking, you can reconstruct how much charge was added or removed since the instant when you started measuring the current.

Sometimes you are measuring a quantity as a function of something other than time, but the principle is the same; all that changes is the variable or dimension over which you take your derivative.

So the name "antiderivative" is very apt: given a function, $f,$ that is a derivative of some physical quantity, we take the antiderivative (integrate $f$) to reconstruct (at least partly) the original physical quantity, which is itself a function, $F.$

Sometimes integration is done over an area or volume. Again, this is just reversing a differential/derivative, but now the derivative is defined over multiple dimensions instead of just one, and it is often thought of as some kind of density.

And we should expect that the tools of calculus will mostly apply relatively directly to physical processes, since that's a very large part of the reason they were first developed.

The integral $\newcommand{\dx}{\,\mathrm{d}x}\newcommand{\mm}{\mathrm{m}}\newcommand{\ss}{\mathrm{sec}}$ \begin{equation*} \int\limits_a^b f(x) \dx \end{equation*} represents how much of something is being accumulated as you sweep $f(x)$ along the $x$-axis from $a$ to $b$. Accumulation is a great word here. Let's illustrate with some examples:

Area under the curve — Considering the graph of $f$, you can always think of $f(x)$ as the distance between the $x$-axis and the point $(x,f(x))$. As $x$ moves from $a$ to $b$ along it's axis, these heights $f(x)$ are sweeping out a region, and $\int_a^b f(x) \dx$ counts how much area you're accumulating in that region while $x$ moves from $a$ to $b$. Note that the units makes sense here: if $x$ and $f(x)$ are measured in meters $\mm$, then the units of $\dx$ must be $\mm$, and so the units of $f(x) \dx$ must be $\mm^2$. Taking the integral $\int$ you're just summing up a bunch of things with units $\mm^2$ so the integral itself will have units $\mm^2$, signifying that it's an area.
Velocity accumulating distance over time — If your function $v(t)$ models the velocity of some object at time $t$, then $\int_a^b v(t) \,\mathrm{d}t$ counts how much distance the object has accumulated as $t$ sweeps from $t=a$ to $t=b$. Note again the units makes sense: the units of $v(t)$ are $\mm/\ss$ and the units on time $t$ are $\ss$, so the units on $\int_a^b v(t) \,\mathrm{d}t$ must be $\mm$.
Accumulating charge over time by integrating the current (David's answer)
Cross-sectional areas sweeping out a region along an axis — Suppose you have some region $R$ that lives along some axis in space $z$, and suppose the function $A(z)$ returns the area of a cross-section (slice) of $R$ at that particular value of $z$. As $z$ sweeps along the axis from $a$ to $b$, $\int_a^b A(z) \,\mathrm{d}z$ counts how much volume we accumulated inside $R$. So if $A(z)$ is accurate along the entire $z$-axis, $\int_{-\infty}^{\infty} A(z) \,\mathrm{d}z$ is the volume of $R$. Measuring everything in meters, note the units of $A(z)\,\mathrm{d}z$ are $(\mm^2)(\mm) = \mm^3$, the units of a volume.
Accumulating mass by integrating density — Suppose you have a region $R$ in three-dimensional space and you want to think of that region as having mass with a function $\delta$ that returns the density of mass in the region at that point. Then the integral $\iiint_R \delta \,\mathrm{d}R$ counts how much total mass the region has.

Note that there's no mention of calculus here at all. All of these examples explain only what the notation $\int_a^b f(x)\dx$ means, but not how to compute $\int_a^b f(x)\dx$. The computation of the values of these integrals happens to involve the anti-derivative of $f$, and that is the true magic you get from the fundamental theorem of calculus.

The integral is basically the total change found from the rate of change.

The way I like to see it is when I look at a package of printing papers, say 500 papers on top of each other, it reminds me of the volume of the package found by slice method namely the integral of the area is the volume.

It is related to the area under the curve where the curve is positive.

For example if you have the rate of money flow you can integrate to find the total money flow.

If you have a formula for the speed you can integrate to find the total distance traveled, or the arc-length.

In Probability

Integration in probability is often interpreted as "the expected value". To build up our intuition why, let us start with sums.

Starting Small

Let's say you play a game of dice where you win 2€ if you roll a 6 and lose 1€ if you roll any other number. Then we want to calculate what you should expect to receive "on average". Now most people find the practice of multiplying the payoff by its probability and summing over them relatively straightforward. In this case you get

$$\text{Expected Payoff} = \frac{1}{6} 2€ + \frac{5}{6}(-1€) = -0.5€$$

Now let us try to formalize this and think about what is happening here. We have a set of possible outcomes $\Omega=\{1,2,3,4,5,6\}$ where each outcome is equally likely. And we have a mapping $Y:\Omega \to \mathbb{R}$ which denotes the payoff. I.e.

$$ Y(\omega) = \begin{cases} 2 & \omega = 6,\\ -1 & \text{else} \end{cases} $$ And then the expected payoff is $$ \mathbb{E}[Y] = \frac{1}{|\Omega|}\sum_{\omega\in\Omega} Y(\omega) = \frac{1}{6}(2 + (-1) + ... + (-1)) = -0.5 $$ where $|\Omega|$ is the number of elements contained in $\Omega$.

Introducing Infinity

Now this works fine for finite $\Omega$, but what if the set of possible outcomes is infinite? What if every real number in $[0,1]$ was possible, equally likely, and the payoff would look like this?

$$ Y: \begin{cases} [0,1] \to \mathbb{R} \\ \omega \mapsto \begin{cases} 2 & \omega > \frac{5}{6} \\ -1 & \omega \le \frac{5}{6} \end{cases} \end{cases} $$

Intuitively this payoff should have the same expected payoff as the previous one. But if we simply try to do the same thing as previously...

$$ \mathbb{E}[Y] = \frac{1}{|\Omega|}\sum_{\omega\in\Omega} Y(\omega) = \frac{1}\infty (\infty - \infty)... $$

Okay so we have to be a bit more clever about this. If we have a look at a plot of your payoff $Y$,

Payoff Plot

we might notice that the area under the curve is exactly what we want.

$$ -1€\left(\frac56-\frac06\right) + 2€ \left(\frac66 - \frac56 \right) = -0.5€ $$

Now why is this the same? How are our sums related to an area under a curve?

Summing to one

To understand this it might be useful to consider what the expected value of a simpler function is

$$ \mathbf{1}: \begin{cases} \Omega \to \mathbb{R}\\ \omega \mapsto 1 \end{cases} $$

In our first example this was

$$ \frac{1}{|\Omega|} \sum_{\omega\in\Omega} \mathbf{1}(\omega) = \frac{|\Omega|}{|\Omega|} $$

In our second example this would be

$$ \int_{\Omega} 1 d\omega = \int_0^1 1 d\omega $$

Now if we recall how the integral (area under the curve) is calculated we might notice that in case of indicator functions, we are weighting the height of the indicator function with the size of the interval. And the size of the interval is its length.

Similarly we could move $\frac{1}{|\Omega|}$ into the sum and view it as the weighting of each $\omega$. And here is where we have the crucial difference:

In the first case individual $\omega$ have a weight (a probability), while individual points in an interval have no length/weight/probability. But while sets of individual points have no length, an infinite union of points with no length/probability can have positive length/probability.

This is why probability is closely intertwined with measure theory, where a measure is a function assigning sets (e.g. intervals) a weight (e.g. lenght, or probability).

Doing it properly

So if we restart our attempt at defining the expected value, we start with a probability space $\Omega$ and a probability measure $P$ which assigns subsets of $\Omega$ a probability. A real valued random variable (e.g. payoff) $Y$ is a function from $\Omega$ to $\mathbb{R}$. And if it only takes a finite number of values in $\mathbb{R}$ (i.e. $Y(\Omega)\subseteq \mathbb{R}$ is finite), then we can calculate the expected value by going through these values, weightening them by the probability of their preimages and summing them.

$$ \mathbb{E}[Y] = \sum_{y\in Y(\Omega)} y P[Y^{-1}(\{y\})] $$ To make notation more readable we can define $$ \begin{aligned} P[Y\in A] &:= P[Y^{-1}(A)] \qquad\text{and} \\ P[Y=y]&:=P[Y\in\{y\}] \end{aligned} $$

In our finite example the expected value is

$$ \begin{aligned} \mathbb{E}[Y] &= 2 P(Y=2) + (-1) P(Y=-1)\\ &=2 P(Y^{-1}[\{2\}]) +(-1)P(\{1,2,3,4,5\})\\ &= 2 \frac16 -1 \frac56 = -0.5 \end{aligned} $$ In our infinite example the expected value is $$ \begin{aligned} \mathbb{E}[Y] &= 2P(Y=2) + (-1)P(Y=-1)\\ &= 2P\left(\left(\frac56, 1\right]\right) - P\left(\left[0, \frac56\right]\right) = \int_0^1 Y d\omega\\ &= 2 \frac16 - \frac56 = -0.5 \end{aligned} $$ Now it turns out that you can approximate every $Y$ with infinite image $Y(\Omega)$ with a sequence of mappings $Y_n$ with finite image. And that the limit $$ \int_\Omega Y dP := \lim_n \int_\Omega Y_n dP := \sum_{y\in Y_n(\Omega)} y P(Y=y) $$ is also well defined and independent of the sequence $Y_n$.

Lebesgue Integral

The integral we defined above is called the Lebesgue Integral. The neat thing about it is, that

Riemann integration is a special case of it, if we integrate over the Lebesgue Measure $\lambda$ which assigns intervals $[a,b]$ their length $\lambda([a,b])=b-a$. $$\int_{[a,b]} f d\lambda = \int_a^b f(x) dx$$
Sums and series are also a special case using sequences $(a(n), n\in\mathbb{N})$ and a "counting measure" $\mu$ on $\mathbb{N}$ which assigns a set $A$ its size $\mu(A) = |A|$. Then

$$ \int_{\Omega} a d\mu = \sum_{n\in\mathbb{N}} a(n) $$

The implications are of course for one, that one can often treat integration and summation interchangeably. Proving statements for Lebesgue integrals is rarely harder than proving them for Riemann integrals and in the first case all results also apply to series and sums.

It also means we can properly deal with "mixed cases" where some individual points have positive probability and some points have zero probability on their own but sets of them have positive probability.

My stochastics professor likes to call integration just "infinite summation" because in some sense you are just summing over an infinite number of elements in a "proper way".

The lebesgue integral also makes certain real functions integrable which are not integrable with riemann integration. The function $\mathbf{1}_{\mathbb{Q}}$ is not riemann integrable, but poses no problem for lebesgue integration. The reason is, that riemann integration subdivides the $x$-axis and $y$-axis into intervals without consulting the function that is supposed to be integrated, while lebesgue integration only subdivides the $y$-axis and utilizes the preimage information about the function that is supposed to be integrated.

Back to Intuition

Now the end result might not resemble our intuition about "expected values" anymore. We get some of that back with theorems like the law of large numbers which proves that averages

$$\frac{1}{n} \sum_{k=1}^n X_k$$

of independently, identically distributed random variables converge (in various senses) to the theoretically defined expected value $\mathbb{E}[X]$.

A note on Random Variables

In our examples above, only the payoff $Y$ was a random variable (a function from the probability space $\Omega$ to $\mathbb{R}$). But since we can compose functions by chaining them, nothing would have stopped us from defining the possible die faces as a random variable of some unknown probability space $\Omega$. Since our payoff is just a function of the die faces, their composition would also be a function from $\Omega$. And it is often convenient not to define $\Omega$ and start with random variables right away, as it allows easy extensions of our models without having to redefine our probability space. Because we treat the underlying probability space as unknown anyway and only work with known windows (random variables) into it. Notice how you could not discern the die faces $\{1,...,5\}$ from payoff $Y=-1$ alone. So random variables can also be viewed as information filters.

Lies

While we would like our measures to assign every subset of $\Omega$ a number, this is generally not possible without sacrificing its usefulness.

If we wanted a measure on $\mathbb{R}$ which fulfills the following properties

translation invariance (moving a set about does not change its size)
countable summability of disjoint sets
positive
finite on every bounded set

we are only left with the $0$ measure (assigning every set measure 0).

Proofsketch: Use the axiom of choice to select a representative of every equivalence class of the equivalence relation $x-y\in \mathbb{Q}$ on the set $[0,1]$. This set of representatives is not measurable, because translations by rational numbers modulo 1 transforms it into distinct other representation sets of the equivalence relation. And since they are disjoint and countable we can sum over them and get the measure of the entire interval $[0,1]$. But an infinite sum of equally sized sets can not be finite if they are not all $0$. Therefore the set $[0,1]$ must have measure 0 and by translation and summation all other sets in $\mathbb{R}$

For this reason we have to restrict ourselves to a set of "measurable sets" (a sigma Algebra) which is only a subset of the powerset $\mathcal{P}(\Omega)$ of $\Omega$. This conundrum also limits the functions we can integrate with lebesgue integration to the set of "measurable functions".

But all of these things are technicalities distracting from the intuition.

Suppose you have some process that, over time, adds to (or subtracts from) a physical quantity. For example, a moving object has velocity, and over time that velocity results in a change in position. Or, if you apply a [variable] force to an object through a distance, over time you exert more and more work to move the object.

In simple cases, one can compute the total amount of [quantity] by multiplying: if the velocity is constant, then $d = rt$. If the force is constant, then $W = Fd$. But what if the process does not proceed at a constant rate? That is one of the problems that integration addresses. So if your force changes over time, you can imagine approximating the force over some small interval of time by a constant (since presumably over a small interval, the force doesn't change much). Do that over the entire range of times you're concerned with; over each such interval it's easy to approximate the work, since the force is roughly constant. So the total work is approximately the sum of each of those easy-to-calculate pieces. Now let the small intervals get smaller and smaller. At least if the force function is reasonable, the sum of those pieces of work gets closer and closer to the total work performed. That total work is the integral of the force over time, and the areas that successively approximate it are various Riemann sums on that time interval.