Is Bayes' Theorem really that interesting?
You are mistaken in thinking that what you perceive as "the massive importance that is afforded to Bayes' theorem in undergraduate courses in probability and popular science" is really "the massive importance that is afforded to Bayes' theorem in undergraduate courses in probability and popular science." But it's probably not your fault: This usually doesn't get explained very well.
What is the probability of a Caucasian American having brown eyes? What does that question mean? By one interpretation, commonly called the frequentist interpretation of probability, it asks merely for the proportion persons having brown eyes among Caucasian Americans.
What is the probability that there was life on Mars two billion years ago? What does that question mean? It has no answer according to the frequentist interpretation. "The probability of life on Mars two billion years ago is $0.54$" is taken to be meaningless because one cannot say it happened in $54\%$ of all instances. But the Bayesian, as opposed to frequentist, interpretation of probability works with this sort of thing.
The Bayesian interpretation applied to statistical inference is immune to various pathologies afflicting that field.
Possibly you have seen that some people attach massive importance to the Bayesian interpretation of probability and mistakenly thought it was merely massive importance attached to Bayes's theorem. People who do consider Bayesianism important seldom explain this very clearly, primarily because that sort of exposition is not what they care about.
While I agree with Michael Hardy's answer, there is a sense in which Bayes' theorem is more important than any random identity in basic probability. Write Bayes' Theorem as
$$\text{P(Hypothesis|Data)}=\frac{\text{P(Data|Hypothesis)P(Hypothesis)}}{\text{P(Data)}}$$
The left hand side is what we usually want to know: given what we've observed, what should our beliefs about the world be? But the main thing that probability theory gives us is in the numerator on the right side: the frequency with which any given hypothesis will generate particular kinds of data. Probabilistic models in some sense answer the wrong question, and Bayes' theorem tells us how to combine this with our prior knowledge to generate the answer to the right question.
Frequentist methods that try not to use the prior have to reason about the quantity on the left by indirect means or else claim the left side is meaningless in many applications. They work, but frequently confuse even professional scientists. E.g. the common misconceptions about $p$-values come from people assuming that they are a left-side quantity when they are a right-side quantity.
You might know only $\Pr[A\mid B]$ and not $\Pr[B\mid A]$, not because someone "adversarially told you the wrong one", but because one of those is a natural quantity to compute, and the other is a natural quantity to want to know.
I am about to teach Bayes' theorem in an undergraduate course in probability. The general setting I want to consider is when:
- We have several competing hypotheses about the world. (Several candidates for $B$.)
- If we assume one of these hypotheses, then we get a nice and easy probability problem where it's easy to find the probability of $A$: some observations that we've made. (Outside undergraduate probability courses, "nice and easy" is a relative term.)
- We want to figure out which hypothesis is likelier.
The mammogram example might be natural, but it's less obviously natural because we have to track down where the numbers that are given to us come from, and ask why we couldn't be given the other quantities in the problem. So here are some examples where we have fewer numbers coming to us out of thin air.
- Suppose you are communicating over a binary channel which flips bits $10\%$ of the time. (This part is given to us out of nowhere, but it's the natural quantity to ask about first.) Your friend has several possible messages they might send you: these are the hypotheses $B_1, B_2, \dots, B_n$. You receive a message: that's the observation $A$. Then $\Pr[A \mid B_i]$ is just $(0.1)^k (0.9)^{n-k}$ if $B_i$ is an $n$-bit message that differs from the one you received in $k$ places. On the other hand, $\Pr[B_i \mid A]$ is the quantity we want: it will tell us how likely it is that your friend sent each message.
- You have a coin, and you don't know anything about its fairness. One possible assumption is that it lands heads with probability $p$, where $p \sim \text{Uniform}(0,1)$, but we could vary this. Then you flip the coin $n$ times and see $k$ heads. There are infinitely many hypotheses $B_p$, one for each possible $p$; under each of them, $\Pr[A \mid B_p]$ is just a binomial probability. Knowing the conditional PDF of $p$, which is what Bayes' theorem tells us, tells us more about how likely the coin is to land heads.
There are two main issues here. One is that on a Bayesian interpretation of probability (this term doesn't reference the theorem, but they're both named for Bayes), probability quantifies how well we know individual events, not detailed available frequency statistics. The best-of-both-worlds hope, if you combine Bayesian and frequentist perspectives, is that past data give us the mammogram values you cited, and an individual woman can be diagnosed based on Bayes's theorem.
The second issue is that $P(A|B)$ need not be remotely close to $P(B|A)$. To wit:
- A test that's usually right may still have most of its positives be false, which warrants some scepticism, as well as further testing.
- Conflating $P(A|B)$ with $P(B|A)$ is a danger in the legal system. Will we arrest people based on accuracy, precision etc., even if their guilt is unlikely? Will "this evidence is unlikely if they're innocent" get them convicted, even though it may not mean their innocence is unlikely? And yes, this has had real-world fallout in both policing and court decisions.
- Statistics tests what probability assumes (e.g. "if this is Gaussian then..."). Statistical tests often boil down to, "we can't measure the probability the null hypothesis is true, but we'll assess it based on the probbaility on the null hypothesis that data at least this surprising would occur". Indeed, which statement gets to be the null hypothesis is more about its facilitating such calculations than its being a "default" or "reasonable" assumption.
Let me start by a memory. From my undergraduate days, 30 years ago, I vividly remember the time when Bayes was introduced. We had spent a lot of time and effort on sampling theory and how to know if things could be proved. And to me, at the time, it always ended up that we needed to have a sample size of x (my remembrance was that a sample size of 7 often was the minimum).
To me Bayes represented a totally different approach which to me was more in alignment with my view of reality. In sampling we looked at groups, with Bayes we started with individual things. So for me this was a very eye-opening addition to the field of probability praxis (and theory of course, but that came later for me). The book we had, written by Raiffa I believe, was about decisition theory. 30 years later I still remember the discussion about whether to do one more test drilling in the oil field.
So, just maybe, in your curriculum the importance placed on Bayes is there to show that statistics does have several different branches, not only sampling theory or how present graphs as correct as possible.