What is the theory behind rigorous hypothesis testing?

I understand that hypothesis testing is essentially a statistical form of proof by contradiction. In proof by contradiction, you assume P, then show that it leads to a result Q, which you know to be false. Since we assume logical consistency, Q and ~Q cannot both be true, therefore ~P.

In hypothesis testing, you start with a hypothesis H and you make an observation, O. If you find that P(O | H) << 1, then we can say that H is unlikely, in a similar way to proof by contradiction, but this reasoning is a heuristic one. What is the formal mechanism behind this result?

How do you get from P(O | H) << 1 to P(H | O) << 1?


Solution 1:

From someone who was long time ago trained as a frequentist...

For a frequentist, $P(H)$ (and the other terms such as $P(O\mid H)$ and $P(H\mid O)$) don't even make sense. Either $H$ is true or $H$ is false. You can only talk about $P(O)$ assuming $H$, or possibly $P(O)$ assuming another hypothesis $H_1$ (the latter being about a different probability space, a different random variable etc.).

If (assuming $H$) $P(O)$ turns out small (e.g. $10^{-6}$), you "reject" $H$. However, the meaning of this rejection is:

  • Not that you "know" $H$ is false (you don't),
  • Not that $H$ is "probably" false with "probability" $1-10^{-6}$ (that does not make sense to a frequentist).

The meaning is only that you can now claim that: either $H$ is false or $H$ is true and you were extremely lucky with your observation - it was one in a million.

Solution 2:

This is a good question!

My immediate reaction: Can't you just use Bayes' Law here?

That is, we have:

$$P(H|O) = \frac{P(O|H) \cdot P(H)}{P(O)}$$

So, while this shows that $P(H)$ and $P(O)$ are also involved, at the very least this shows that $P(O|H)$ and $P(H|O)$ are proportionally related: the lower $P(O|H)$, the lower $P(H|O)$.

Indeed, if we find $P(O|H)<<1$, then assuming 'comparable' values of $P(H)$ and $P(O)$, it would make sense that $P(H|O)<<1$

OK, but are $P(O)$ and $P(H)$ indeed 'comparable'?

Well, it can be pointed out that in practice we typically try to make $P(O)$ to be fairly low. That is, you don't want to run an experiment where a large range of hypotheses would all predict $O$, i.e. you want to make a strong prediction. For example, if we are coming up with some hypotheses as to how strong of a person I am, we are not going to run an experiment that tests whether or not I can lift a pencil (which, even if I am able to, tells us very little as to my strength) but rather one that tests whether I can lift a refridgerator (which, if true, would say a lot more)

Likewise the most useful theories are strong theories, i.e. ones for which $P(H)$ is low. A theory that says that I can lift objects up to $1$ pound is not a very useful theory.

And finally, the best experiments are crucial experiments, where each unique hypotheses is associated with its very own unique observation, making it plausible that $P(O)$ and $P(H)$ are indeed 'comparable'.

One more thought. How would it be possible that $P(O|H)<<1$ and yet we don't have that $P(H|O)<<1$? By Bayes' Law this would be when $P(H)$ is high and $P(O)$ is low (or at least $P(H)$ is comparatively much higher than $P(O)$). What would that be like? It would be some kind of context where we run an experiment that makes a prediction that is far more specific than the hypothesis ... And how could that be? Well, maybe the prediction is relying on lots of factors that have nothing to do with the hypothesis. For example, it could be that the prediction can only be made when relying on all kinds of auxiliary hypotheses that are in fact unlikely to be true (or at least: significantly decrease the probability of the prediction to be true). Or: the prediction relies on one's ability to run a super-controlled experiment ... which may be unlikely to be the case. And yes, in those kinds of cases, it does make sense to say that just because some prediction didn't come out true (i.e. we didn't observe $O$), doesn't mean that the hypothesis is unlikely, as the problem may well lie elsewhere (and here is a link with the proof by contradiction as well: if $H$ (hypothesis true) and $A$ (auxilliary hypothesis true) and $E$ (experiment perfectly executed) lead to a contradiction, we can reject $H$ .. but we could also reject $A$, or $E$. So, that is maybe the lesson here ... and why I believe your question and observation that $P(H|O)$ and $P(O|H)$ are not the same thing is an important distinction for practicing scientists to keep in mind!

Solution 3:

It depends whether you ask a Bayesian or a frequentist. The Bayesian would essentially give the answer that @Bram28 gave, namely that P(O|H) and P(H|O) are related by Bayes' theorem.

Let me give the frequentist view instead. A frequentist would argue that P(H|O) does not make sense, because whether a hypothesis holds or not is not random. But what makes sense is P(O|H) and P(O|K) (when K denotes the alternative hypothesis). Loosely speaking, if one (say, the latter) of the two conditional probabilities is much larger than the other, one decides for K. Of course, the formal mechanism (involving a level, p-values, etc.) is more involved, but the core idea remains the same: