Pretending you planned to test that hypothesis the whole time


Our scientific papers often harbor a massive silent fiction.

Papers often lead the readership into thinking that the main point of the scientific paper was the main point of the experiment when it was conducted. This is sometimes the case, but in many cases it is a falsehood.

How often is it, when we publish a paper, that we are writing up the very specific set of hypotheses and predictions that we had in mind when we set forth with the project?

Papers might state something like, “We set out to test whether X theory is supported by running this experiment…”  However, in many cases, the researchers might not even have had X theory in mind when running the experiment, but were focusing on other theories at the time. In my experience in ecology, it seems to happen all the time.

Having one question, and writing a paper about another question, is perfectly normal. This non-linearity is part of how science works. But we participate in the sham of , “I always meant to conduct this experiment to test this particular question” because that’s simply the format of scientific papers.

Ideas are sold in this manner: “We have a question. We do an experiment. We get an answer.” However, that’s not the way we actually develop our questions and results.

It could be: “I ran an experiment, and I found out something entirely different and unexpected, not tied to any specific prediction of mine. Here it is.”

It somehow is unacceptable to say that you found these results that are of interest, and are sharing and explaining them. If a new finding is a groundbreaking discovery that came from nowhere (like finding a fossil where it was not expected), then you can admit that you just stumbled on it. But if it’s an interesting relationship or support for one idea over an other idea, then you are required to suggest, if not overly state, that you ran the experiment because you wanted to look at that relationship or idea in the first place. Even if it’s untrue. We don’t often lie, but we may mislead. It’s expected of us.

In some cases, the unexpected origin of a finding could be a good narrative for a paper. “I had this idea in mind, but then we found this other thing out which was entirely unrelated. And here it is!” But, we never write papers that way. Maybe it’s because most editors want to trim every word that could be seen as superfluous, but it’s probably more caused by the fact that we need to pretend to our scientific audience that our results are directly tied to our initial questions, because that’s the way that scientists are supposed to work. It would seem less professional, or overly opportunistic, to publish interesting results from an experiment that were not the topic of the experiment.

Let me give you an example from my work. As a part of my dissertation, in the past millennium, I did a big experiment in which I and my assistants collected a few thousand ant colonies, in an experimental framework. It resulted in a mountain of cool data. This is a particularly useful and cool dataset in a few ways, because it has kinds of data that most people typically cannot get, even though they can be broadly informative (There are various kinds of information you get from collecting whole ant colonies that you can’t get otherwise.) There are all kinds of questions that my dataset can be used to ask, that can’t be answered using other approaches.

For example, in one of the taxa in the dataset, the colonies have a variable number of queens. I wanted to test different ideas that might explain environmental factors shaping queen number. This was fine framework to address those questions, even though it wasn’t what I had in mind while running the experiment. But when I wrote the paper, I had to participate in the silly notion that that experiment was designed to understand queen number (the pdf is free on my website and google scholar).

When I ran that experiment, a good while ago, the whole reason was to figure out how environmental conditions shaped the success of an invasive species in its native habitat. That was the one big thing that was deep in my mind while running the experiment. Ironically, that invasive species question has yet to be published from this dataset. The last time I tried to publish that particular paper, the editor accused me of trying to milk out a publication about an invasive species even though it was obvious (to him at least) that that wasn’t even the point of the experiment.

Meanwhile, using the data from the same experiment designed to ask about invasive species, I’ve written about not just queen number, but also species-energy theory, nest movement, resource limitation, and caste theory. I also have a few more in the queue. I’m excited about them all, and they’re all good science. You could accuse me of milking an old project, but I’m also asking questions that haven’t been answered (adequately) and using the best resources available. I’m always working on a new project with new data, but just because this project on invasive species was over many years ago doesn’t mean that I’m going to ignore additional cool discoveries that are found within the same project.

Some new questions I have are best asked by opening up the spreadsheet instead of running a new experiment. Is that so wrong? To some, it sounds wrong, so we need to hide it.

You might be familiar with the chuckles that came from the bit that went around earlier this year, involving Overly Honest Methods. There was a hashtag involved. Overly honest methods are only the tip of the proverbial iceberg about what we’re hiding in our research.

It’s time for #overlyhonesthypotheses.

28 thoughts on “Pretending you planned to test that hypothesis the whole time

  1. It has always seemed to me that “milking’ a data set is both time and money-efficient. I don’t understand why it isn’t widely and actively encouraged.

  2. “Some new questions I have are best asked by opening up the spreadsheet instead of running a new experiment. Is that so wrong? To some, it sounds wrong, so we need to hide it.”

    Speaking as a non-scientist, but as someone who did take a fair bit of science in college, I can tell you that the first thing that I learned, on the first day of at least two (physics, chemistry) of the intro science courses I took, was the adage ‘six months in the laboratory can frequently save you an hour in the library.’

  3. Nor I. On the other hand, I’m consistently frustrated by some who do an experiment and write papers so focused on pet theories, with the full value of those projects never seeing the light of day.

  4. I generally agree. However, sometimes you set out to do a specific experiment and you pull it off successfully. Then you report it. No muss, no fuss.

  5. Technically speaking, you have lost at least some statistical support for any hypothesis that arises after the data are gathered because you knew something about the data (e.g., the patterns) before asking the question. This is like playing cards with a marked deck. To have full statistical power you’d need to use the observation (from the data) to devise a new test of the new hypothesis. You can’t claim you tested any hypothesis you didn’t have in mind BEFORE gathering the data that led to it. Any hypothesis not in mind until the data were examined needs a new test or at least should use post-hoc significance criteria (i.e., using a smaller P value to denote ‘significance’).

  6. Depends what you mean by “milking” a dataset. Letting the data tell you what hypothesis to test (e.g., by doing a bunch of different “exploratory” analyses), and then testing that hypothesis on the same data, is not efficient. It’s circular reasoning.

    The point isn’t that you shouldn’t do exploratory analyses. But you need to build a firewall between exploratory and hypothesis-testing analyses.

  7. So, let’s say that you do exploratory analyses and make a find that is worthy of a separate paper (and distinct enough that it won’t fit whatever else you’re doing). How do you write this manuscript if you can’t say that you’re asking a specific question or hypotheses?

  8. I’m unclear exactly what you mean by “do exploratory analyses and make a find”. What’s at issue here is what it takes to say, reliably, that you have in fact “made a find”. As opposed to having fooled yourself into thinking you’ve made a find, by confounding exploratory and hypothesis-testing analyses.

    Perhaps this will help: the issue isn’t the purpose for which the data were originally collected. If you have some hypothesis that can be tested using an existing dataset that was originally collected for a different purpose, that’s fine. What’s seriously problematic is not deciding what hypotheses to test until after you’ve seen the data and fiddled around with it.

  9. So, if you unexpectedly find a very interesting and presumably nonspurious relationship while fiddling, then you’re supposed to not publish it? Run a whole new experiment? Invent a lower alpha based on the amount of time spent fiddling?

  10. Terry, in saying that the relationship you found is “presumably nonspurious”, you’re presuming precisely what I’m saying you shouldn’t presume.

    Try this exercise. It’s a simple, standard illustration of the problem I’m talking about. In R, generate, say, twenty vectors of independent random numbers. Drawn from whatever distribution or distributions you like, it doesn’t matter. Now run a whole bunch of statistical tests on those twenty variables. Say, test all their pairwise correlations at the alpha=0.05 level. You’re very likely to find some “significant” correlations, which by construction are completely spurious.

    So yeah, if that sort of thing is what you mean by “exploratory” analysis, then yes–running a new experiment in order to test the hypothesis that your exploratory analyses just generated is *precisely* what you ought to do. Or, one alternative is to randomly divide your data in half, only do exploratory analyses on half the data, and then test the hypotheses thereby generated on the other half of the data.

    I’m worried that we may be talking past each other, perhaps I’m not understanding what you mean by “exploratory” analysis. I’m worried because, as noted by other commenters, the point I’m making is a standard one made in most every undergraduate biostats course. So I can hardly believe this point isn’t familiar to you. Apologies if I’m totally misunderstanding what you’re getting at here.

  11. To put the point another way: the result of any frequentist statistical test, however summarized (as a p-value, a 95% c.i., whatever), is a statement about what would be expected to happen if a specified procedure were repeated many times and a specified hypothesis about the world were true. The test is only valid if the procedure specified by the test is the one you actually used. The procedure is “explore the dataset by conducting a whole bunch of statistical tests, and then only keep and report whichever ones happen to come out ‘significant’ at a nominal alpha of 0.05” is not valid, because that’s not the procedure under which your statistical tests were derived. If you insist on sticking with that procedure, then you somehow need to figure out the error rates *of that procedure* (which of course you couldn’t do in practice because the procedure isn’t fully specified…)

  12. You’re right – it is a point often made.

    But, you know, in fact, that’s not how people do things. Are people committing massive statistical errors or is this not a big deal? As an author, reviewer, and editor, I’ve never seen anybody called out for this. Even when it’s obvious that I was testing new questions with old data. Nobody wanted me to lower an alpha because I found a pattern in old data for an experiment designed for something else.

    I wasn’t joking when I asked what I should do with the presumably nonspurious result. Okay, don’t make the presumption. Then what?

    It looks like most folks just go ahead and use the same alpha they would have anyway.

    This is a decades-old notion about multiple comparisons, and your post on that addresses it really well. Not much more to say about that, I think.

    This mess about creating false narratives for our papers is based, in part, on the overemphasis of hypotheses falsification and putting too much stock in probabilities rather than parsimony and reason. Too many let p-values do the thinking on their behalf, when the real understanding comes from the constraints and assumptions built into that p.

    But, I’m not a statistician, so that’s territory for others.

  13. Hi Terry,

    Multiple comparisons is a special case of the broader issue.

    And yes, it is indeed a really serious issue. You absolutely can make a really strong argument that standard operating procedure in much of science (which is indeed much as you describe it) actually invalidates a really big chunk of the scientific literature.

    See these old posts for some discussion:

    See also statistician Andrew Gelman’s blog. He’s been talking a lot about this issue recently, and what can be done about it. He calls it “the replicability crisis”.

  14. I started to write the same comment as Jeremy then decided to let it be. But I will add that even adjusting alpha for the exploration will likely lowball the number of ways you turned the data inside-out before “discovering” the “effect”. Along the route, perhaps you adjusted for size three different ways, found correlations on log-transformed and untransformed data, analyzed residuals adjusted for some effects and not others, included different subsets of cases because one case is just “different” but justified by some post-hoc biological knowledge of why they should be different, etc. etc. You’ve now returned many, many p-values. But unless you have archived the entire script, you will most likely forget many of these dead-end exploratory paths.

  15. In my areas (various biological topics) this is widely appreciated and I routinely criticize manuscripts in which the hypothesis and test(s) were not related correctly.

  16. Yes, this is a long discussion, and a worthwhile one that has happened often.

    For those who have this as a major concern, what is the prescription?
    A) Don’t fiddle at all
    B) Fiddle but never publish interesting results that come this way
    C) Fiddle but keep track of the number of tests to adjust alpha
    D) Only do new experiments to test fiddled-out results
    E) Other

    I don’t think anybody really does A,B,C or D in practice. Am I wrong?

    If I use hundreds of thousands of dollars of taxpayer money and thousands of person-hours on a project, you can bet I’ll be looking to see what additional information can be reasonably surmised from the work. I guess this discussion is about what constitutes “reasonable.”

  17. Some combination of A, D, and E. My comments above, and the old posts I linked to, have some suggestions on what “E” might consist of.

  18. Thanks. (By the way, so, from the Simmons et al. article, do you always follow the four suggested guidelines for reviewers? If so, it must be might hard as a reviewer to recommend ‘accept’ for nearly any article at all, even excellent ones, because which mauscript really provides enough information to show that they satisified all of the six guidelines for authors?)

  19. In regards to the debate above, could we just distinguish “confirmatory” and “exploratory” research (à la “On Confirmatory versus Exploratory Research” by Jaeger and Halliday, Herpetologica, 1998)? Call each what they are. We can rigorously test hypotheses with reliable statistical methods AND report interesting relationships that are worthy of further investigation. Perhaps the latter should be done without p-values and using effect sizes (Pearson’s r, etc.) instead to identify the most promising avenues of research. Maybe we need to sometimes demand LESS of our research and not require hypothesis tests, but only as long as it is clearly identified as exploratory research. Could/should such “studies” be publishable? (Well, they are already being published. Could they be published done correctly, i.e. without wrong p values from data snooping?)

  20. According to Wikipedia Kepler tried 43 different models to describe the motion of planets round the sun before finally hitting on ellipses. He used, as far as I know, the same data set throughout. Was this bad science and should it have been published?

  21. Fitting models to data is not hypothesis-testing; it is hypothesis-creation. So if you define science as testing hypotheses, it was not “science”. I believe those hypotheses have been tested, probably repeatedly, incorrect ones discarded, and one supported, since. Personally, I’m not sure that some of the measurements astronomers make are really capable of testing hypotheses, and I’ll bet some are not replicated. No replication means it’s not believable, IMHO.

  22. The p values aren’t wrong. The alpha, though, is what some folks think should shrink if a result is found using exploratory analyses.

    Half of this problem would resolve itself once people get over using a firm significance threshold, when we all should be using nuance and our understanding of statistics to provide a more valid approach to interpreting results, based on p, analytic approach, effect sizes and also the power of he test.

  23. This is exactly what happened with the images of the graphene molecules, that wasnt the focus of the research

Comments are closed.