After 20 years or so in research, I think that I am finally 95% sure that I understand the significance of p.
It is often said (as I have done) in many a Methods Section that a “p-value of 0.05 was considered statistically significant”. I now think this is sloppy writing.
First an aside about the dangers of a threshold. When p>0.05 there’s often several (bad) things we might do:
- Such as really bad- I.e. p-hacking https://pubmed.ncbi.nlm.nih.gov/25768323/ https://pubmed.ncbi.nlm.nih.gov/27510514/
- Or sometimes useful and sometimes hilarious- I.e. playing fast and loose with our words (like these many examples: https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/ https://twitter.com/mc_hankins)
But this all seems to be because we all use p<0.05 when we mean that α=0.05 is our threshold for significance.
The p-value is just a probability value for a test- it is the probability of the null hypothesis being true.
Remember that hypothesis testing is working to reject the null hypothesis, we do not validate the hypothesis.
If I hypothesize that A causes B, the Null hypothesis (H0) is that A does not cause B. The working hypothesis (H1) is that A does cause B).
So if I say my threshold for statistical significance is 0.05, I’ve made an a-priori determination that I need at least 95% confidence that the null hypothesis is false.
This means I’ll say A causes B if I have a 95% or greater chance of not making a Type I error. (A Type I error is the probability of a false positive). Thus, I have a less than 5% chance of being wrong when I say A causes B.
(Note, I didn’t say find anything a Type II error… For that, we’ll need to look at Power (1-β)… stay tuned…)
Realizing this has made me try to put real p-values in papers instead of ranges.
(Well, I’ll put p<0.001 because a less than 1:1000 chance of an error starts to get unhelpful. But a p=0.01 means a 1:100 chance of a Type I error, a p=0.005 means a 5:1000 or 1:200 chance of a Type I error, etc. I personally think those values are good to know!)
Why does this matter? Compare a p=0.051 versus a p=0.049. You have a 5.1% chance of making an error versus a 4.9% chance of making an error.
This is a difference that doesn’t seem important to me, but many times, a p=0.051 leads to p-hacking- I.e. a researcher will add one more sample (or exclude a sample) to get p<0.05. That’s picking an answer you want, not doing research.
Nonetheless, we typically report a p<0.05 as statistically significant. It’s a good short hand, but don’t forget what it means. It is only a threshold and we can always make the threshold bigger (more likely to make an error) or smaller (less likely to make an error).
Instead of saying a “p-value of 0.05 was considered statistically significant”, I’ll try to write, “α was set at 0.05”. Regardless of the p-value, I can compare my p to my α.
No promises that I’ll do this perfectly, but I’ll start trying to edit my stats sections to read that “α was set to 0.05”.
tl/dr: α is a threshold that is arbitrarily set. A p-value is the probability a test is a true positive and is for an individual test, not the set. Being bigger than 0.05 (or α) doesn’t make a p-value wrong.