Understanding the P-value
Solution 1:
When browsing the posts flaired Mathematics at r/askscience, I lighted upon the same question! I quote u/tadrinth, but reformat his comment a shade.
The p-value is the likelihood of getting data at least as unlikely as the data you actually got, given that the null hypothesis is true. If you leave out any part of that definition, you are not talking about p-values and you will mess up in your reasoning (and I would ding you points on your exam, I used to teach intro statistics).
Null hypothesis testing is a probabilistic version of proof by counterexample. You assume something, you show that your data is very unlikely given that assumption, and that provides evidence that your assumption is false.
For example, you might assume that your drug is no better than placebo (is equally likely to cure patients). Then you administer the drug to a bunch of patients, and the placebo to a bunch more patients, and you compare how many patients were cured.
If more of the patients who got the drug get better, then you have evidence against the null hypothesis that they're the same.
But, if roughly similar numbers of patients were cured, then you haven't shown anything at all; maybe your drug isn't better, or maybe you just didn't have enough patients to prove anything.
The bigger the difference between the two, the smaller the p-value, because it's less likely the drug and placebo are the same if they cure different numbers of people.
The more people you have in the study, the smaller the p-value, because who gets better has an element of chance, and you might randomly have some placebo people get better on their own.
If there's no element of chance, you don't really need p-values; it'll just be really obvious, since one thing is always better. P-values are for when you cannot tell by eye because there's too much data and the difference is too small, or when you want to prove the difference to someone.
In practice, we want to make up our minds at some point, even though this whole thing is probabilistic. To do that, we say that if the p-value is below a certain threshold (this threshold is called alpha; the proposal under consideration is lowering alpha from 0.05 to 0.005), we reject the null hypothesis. If you designed your experiment right, if the null hypothesis is false, then some other hypothesis of interest must be true, and then your experiment will show support for that other hypothesis.
Lowering that threshold is demanding more evidence before we are willing to reject the null hypothesis. It basically says how often we are willing to reject the null hypothesis when it's actually true. Going back to the drug example, if we insist on $p = 0.05$, that means that out of every 20 drug trials where the drug isn't better than placebo, we'll average one study that incorrectly finds that the drug is better (a false positive). If we ask for $p = 0.005$, then we'll only be wrong 1 in 200 drug trials. The threshold is essentially arbitrary; the 2 standard deviations thing is probably why that particular threshold was picked, yes, it makes it really easy to calculate since we can almost always assume an approximately normal distribution when doing these tests.
Given just how many drug trials are performed, that might be a good thing to do; we don't want to approve a drug that doesn't actually do anything! All drugs have side effects, and all drugs cost money, so drugs that don't do anything are bad.
Now, if you've been paying attention, you might note that the p-value just talks about the likelihood of the null hypothesis. And we don't actually care about the null hypothesis at all! We want to know if our hypothesis is correct, not if our null hypothesis is wrong.
This is a fundamental limitation of p-values and the philosophy of statistics that spawned them, because that philosophy does not really believe in probabilities, only proportions. So it isn't about the probability of a study being wrong, it's about the proportion of studies that we're willing to accept being wrong in a certain way. Asking about the probability of a hypothesis is a nonsense question; hypotheses must be either true or false.
The other philosophy of statistics is perfectly happy to assign probabilities to hypotheses, with the probability representing our own uncertainty about whether the hypothesis is true or false. In reality, it's one or the other, but since we don't know, probabilities are a great way to measure our uncertainty.
Unfortunately the other philosophy of statistics often involves math that was intractable before computers, and the p-value philosophy had a very aggressive advocate, and so p-values became extremely popular, even though 99% of the people using them are using them incorrectly because they forget that the p-value is the probability of the data GIVEN THAT THE NULL HYPOTHESIS IS CORRECT. They leave off that part and just treat the p-value as the probability that the null hypothesis is correct.
This ignores the probability of the null hypothesis given everything else you know, which is what you need to transform from what the p-value actually is to what people think it is. And if the hypothesis your testing is deeply unlikely to be true, which is extremely common since there are many, many possible hypotheses, then you need WAY more evidence before you decide to assign high probability to it, and just taking the p-value will mess you up rather badly.
This is all also ignoring the fact that if you really want to reject the null hypothesis, you can just run 20 studies and one of them will come up rejecting just by chance (at a p-value of 0.05). Going to 0.005 would require 200 studies to get one to come up by chance.
And there are many, many, many ways to fiddle with your data to turn one dataset into 20 datasets, find one that rejects, and then not really talk about the other 19. Or to otherwise get a low p-value when you shouldn't. I've seen analyses where the math in a paper would reject the null hypothesis something like 60% of the time for totally random data.
So, if we lower to 0.005, it will be harder to publish papers, because papers only get accepted if they say "we rejected the null hypothesis (which is taken to mean 'my hypothesis was right')"; saying 'we failed to reject the null' is taken as 'my study didn't show anything and was a giant waste of time'. Whether this would be catastrophic, I couldn't tell you; it will certainly be bad for labs publishing papers about results that aren't real.
Solution 2:
In statistics, the p-value is the probability that, using a given statistical model, the statistical summary (such as the sample mean difference between two compared groups) would be the same as or more extreme than the actual observed results.
Less technical, lets say the null hypothesis is actually true. With p-value we calculate the probability that the statistic would be the same as or more extreme than the value we calculate from the sample(e.g. sample mean). So we can interpret p-value as how much our null hypothesis supports our data. If that probability is lower than a pre-determined level, we conclude that it is unlikely that null hypothesis is actualy true.
https://en.wikipedia.org/wiki/P-value
Solution 3:
I lightly rewrite this already outstanding /r/eli5 comment, to simplify it further. I changed "New Yorker to "Utahn" as the latter's shorter.
Suppose you want to show that, say, Texans eat more than Utahns do. What you're really trying to prove is that Texans do not eat the same or less amount as Utahns do. This statement ("Texans and Utahns eat the same amount") is something called your "null hypothesis". Hypothesis testing has the goal of disproving the null hypothesis to prove what you're trying to show.
The idea of statistical testing is to say "well, assuming that Texans and Utahns did eat the same amount, how likely would we get the data we did? The chance of getting the data you got if, in fact, they did eat the same amount is called a p-value. For instance, if we say that p = 0.05, we mean that if Texans and Utahns ate the same amount, there'd be a 1/20 chance to observe the kinds of results we did observe. The lower the p-value, the less likely your null hypothesis is true, and the more confidence you have that, in fact, Texans do eat more.
Significance is the lowest p-value you'll accept as "strong enough" evidence. Lower significance thresholds decrease your chances of a false positive (i.e., finding that Texans eat more when in fact they don't), but increase your chances of a false negative (concluding that you don't know that Texans eat more, when in fact they do). Usually 5% is the weakest significance anyone takes seriously, but for situations where there's extreme cost to a false positive, you may choose a much lower number like 0.1%.