How do we move beyond an arbitrary statistical threshold?

Standard

In science, we’re used to suboptimal methods — because of limited time, resources, or technology. But one of our biggest methodological shortcomings can be fixed as soon as we summon the common will. The time is overdue for us to abolish 5% as a uniform and arbitrarily selected probability threshold for hypothesis testing.

A rigid alpha (α) of 0.05 is what we generally use, and it’s what we generally teach. It’s the standard that is generally used by folks reviewing manuscripts for publication.

I get that not everybody is using an alpha of 0.05 all of the time, as there are pockets where people do things differently. But common practice is to use a straight-up probability threshold to accept or reject a null hypothesis. Even though we all have recognized this is not a good idea, this is what we do.

How did we get here anyway? It appears we’ve landed at α = 0.05 because that’s what Sir Ron Fisher liked, he gave it an asterisk, and put into his statistical tables. He just thought this rate of error seemed to be okay: “If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level.”

Really, Sir Ron? That’s all you got? It sounds like Fisher pretty much drew an alpha of 0.05 out of his butt, and everybody adopted it as a convention, because, well, he was RA Fisher. We could have done a lot worse, I guess.

A p-value is not a simple thing, and simple explanations of p-values miss the boat. When the American Statistical Association has to release a statement to clarify what p-values are and what p-values are not, it’s clear these are deep waters. (Their press release from 2016 is a good quick explainer, and more is from 538.) I don’t want to delve deeply into the arguments for why having a rigid alpha are a bad idea, because others have paved that ground quite well. P-values are generally approximate values anyway, given the assumptions built into tests, as explained by Brian McGill. It’s fundamentally absurd that we make different decisions about results that have only a hair’s width difference in probability, explains Stephen Heard. By having a fixed threshold for evidence, this invites p-hacking, which is the modus operandi for many scientists.

I just read a high quality argument describing these issues, in a paper entitled “Use, Overuse, and Misuse of Significance Tests in Evolutionary Biology and Ecology.” And you know what? THIS WAS PUBLISHED IN 1991. Nineteen fricking ninety one. Very little has seemed to have changed since then. (Which, come to think of it, is the year that I took undergrad biostatistics.)

We know what the problem is. We still just aren’t fixing it.

It’s not as if we don’t know what do to instead. How about we use some friggin’ nuance? We let the author, the reviewer, the editor, and the readers decide what the probability values mean in the context of a given experiment? To get a little more specific, I think we’d all be better off if we ramp up the importance of effect sizes, and not fuss too much about p-values as much. How do we go about this? Nakagawa and Cuthill (2007) have us all covered — this review telling us how to use effect sizes is on its way to being cited 2000 times.

We can’t expect people to do any better unless we teach this better. My favorite stats textbook at the moment handles p-values pretty well, I think, but also treats p-values as the central product of statistics, which to be fair, is how they are generally handled nowadays.

What are some reasonable steps that we all can be doing? We can include effect size statistics whenever we report p-values. We can abolish the phrase “statistically significant” from our vocabulary. I’m not saying everybody needs to go full Bayesian, but maybe we could stop treating model selection like stepwise regression? I don’t think we need to accept a new set of guidelines to replace an alpha of 0.05, like some folks recommend. We just need to allow us put our sophisticated training to good use and interpret a full body of statistical evidence, instead of using a choose-your-own-adventure approach to deciding if our results are meaningful or meaningless.

To be clear, I’m not arguing (here) that we should be ditching the hypothesis falsification approach to answering questions. I just think we need to be smarter about it.

https://twitter.com/hormiga/status/988298618745503744

8 thoughts on “How do we move beyond an arbitrary statistical threshold?

  1. Coming from a totally different field (social science) I would have thought that the excellent remarks you make about p values should also be made about effect sizes. There is no absolute effect size (say half a standard deviation of something) that is the right one in all situations. With very large numbers a small difference in the effect of a drug may save numerous lives. From the point of view of an individual the effect may be too small to make the side effects worthwhile enduring, Similarly while in general big effects are more interesting than small ones small effects can be theoretically interesting (like very small perturbations in the orbit of a planet that don’t fit Newton’s laws). And finally there are issues of practicality. The people looking for the Higgs Boson seemed to have insisted on significance levels that were totally beyond ambition of any social scientist. They did it because of the importance of the issue and because they could. I would like such significance levels but to aspire to them would be simply silly.

    • Ian, I wholeheartedly agree that we shouldn’t have any kind of effect size threshold to replace alpha as a probability threshold. I think by reporting effect sizes, and using this as a part of a nuanced interpretation, this would just provide a more complete picture as a supplement to p-values, which would wean us off of our reliance of p as an essential criterion.

  2. I won’t rehash comments already made, but I find this part of Fisher’s quote interesting: “Personally, the writer prefers to ignore entirely all results which fail to reach this level [=0.05]”

    First of all, by this definition there is no such thing as “almost/approaching/marginally” significant. Ignore it entirely. Second, Fisher doesn’t say the converse is true, i.e. that you should entirely accept all results on the other side of that threshhold, at least not in that quote. Just that it becomes worth your time to even think about it.

  3. Sorry – thought the link above would show the title: “Are significance thresholds appropriate for the study of animal behavior?” (Animal Behaviour, 1999)

  4. People are looking for ways to replace the p-value or the threshold of 0.05 because they think that the problems will then go away, but replacing one thoughtless procedure with another is not going to help. The changes that will improve the ways that science and publication work must involve a requirement for more explicit consideration of evidence and principled argument in context. More nuance, indeed.

Leave a Reply to Andrew StoehrCancel reply