In science, we’re used to suboptimal methods — because of limited time, resources, or technology. But one of our biggest methodological shortcomings can be fixed as soon as we summon the common will. The time is overdue for us to abolish 5% as a uniform and arbitrarily selected probability threshold for hypothesis testing.
A rigid alpha (α) of 0.05 is what we generally use, and it’s what we generally teach. It’s the standard that is generally used by folks reviewing manuscripts for publication.
I get that not everybody is using an alpha of 0.05 all of the time, as there are pockets where people do things differently. But common practice is to use a straight-up probability threshold to accept or reject a null hypothesis. Even though we all have recognized this is not a good idea, this is what we do.
How did we get here anyway? It appears we’ve landed at α = 0.05 because that’s what Sir Ron Fisher liked, he gave it an asterisk, and put into his statistical tables. He just thought this rate of error seemed to be okay: “If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level.”
Really, Sir Ron? That’s all you got? It sounds like Fisher pretty much drew an alpha of 0.05 out of his butt, and everybody adopted it as a convention, because, well, he was RA Fisher. We could have done a lot worse, I guess.
A p-value is not a simple thing, and simple explanations of p-values miss the boat. When the American Statistical Association has to release a statement to clarify what p-values are and what p-values are not, it’s clear these are deep waters. (Their press release from 2016 is a good quick explainer, and more is from 538.) I don’t want to delve deeply into the arguments for why having a rigid alpha are a bad idea, because others have paved that ground quite well. P-values are generally approximate values anyway, given the assumptions built into tests, as explained by Brian McGill. It’s fundamentally absurd that we make different decisions about results that have only a hair’s width difference in probability, explains Stephen Heard. By having a fixed threshold for evidence, this invites p-hacking, which is the modus operandi for many scientists.
I just read a high quality argument describing these issues, in a paper entitled “Use, Overuse, and Misuse of Significance Tests in Evolutionary Biology and Ecology.” And you know what? THIS WAS PUBLISHED IN 1991. Nineteen fricking ninety one. Very little has seemed to have changed since then. (Which, come to think of it, is the year that I took undergrad biostatistics.)
We know what the problem is. We still just aren’t fixing it.
It’s not as if we don’t know what do to instead. How about we use some friggin’ nuance? We let the author, the reviewer, the editor, and the readers decide what the probability values mean in the context of a given experiment? To get a little more specific, I think we’d all be better off if we ramp up the importance of effect sizes, and not fuss too much about p-values as much. How do we go about this? Nakagawa and Cuthill (2007) have us all covered — this review telling us how to use effect sizes is on its way to being cited 2000 times.
We can’t expect people to do any better unless we teach this better. My favorite stats textbook at the moment handles p-values pretty well, I think, but also treats p-values as the central product of statistics, which to be fair, is how they are generally handled nowadays.
What are some reasonable steps that we all can be doing? We can include effect size statistics whenever we report p-values. We can abolish the phrase “statistically significant” from our vocabulary. I’m not saying everybody needs to go full Bayesian, but maybe we could stop treating model selection like stepwise regression? I don’t think we need to accept a new set of guidelines to replace an alpha of 0.05, like some folks recommend. We just need to allow us put our sophisticated training to good use and interpret a full body of statistical evidence, instead of using a choose-your-own-adventure approach to deciding if our results are meaningful or meaningless.
To be clear, I’m not arguing (here) that we should be ditching the hypothesis falsification approach to answering questions. I just think we need to be smarter about it.