I own my data, until I don’t.

Standard

Science is in the middle of a range war, or perhaps a skirmish.

Ten years ago, I saw a mighty good western called Open Range. Based on the ads, I thought it was just another Kevin Costner vehicle. But Duncan Shepherd, the notoriously stingy movie critic, gave it three stars. I not only went, but also talked my spouse into joining me. (Though she needs to take my word for it, because she doesn’t recall the event whatsoever.)

The central conflict in Open Range is between fatcat establishment cattle ranchers and a band of noble itinerant free grazers. The free grazers roam the countryside with their cows in tow, chewing up the prairie wherever they choose to meander. In the time the movie was set, the free grazers were approaching extirpation as the western US was becoming more and more subdivided into fenced parcels. (That’s why they filmed it in Alberta.) To learn more about this, you could swing by the Barbed Wire Museum.

The ranchers didn’t take kindly to the free grazers using their land. The free grazers thought, well, that free grazing has been a well-established practice and that grass out in the open should be free.

If you’ve ever passed through the middle of the United States, you’d quickly realize that the free grazers lost the range wars.

On the prairie, what constitutes community property? If you’re on loosely regulated public land administered by the Bureau of Land Management, then you can use that land as you wish, but for certain uses (such as grazing), you need to lease it from the government. You can’t feed your cow for free, nowadays. That community property argument was settled long ago.

Now to the contemporary range wars in science: What constitutes community property in the scientific endeavor?

In recent years, technological tools have evolved such that scientists can readily share raw datasets with anybody who has an internet connection. There are some who argue that all raw data used to construct a scientific paper should become community property. Some have the extreme position that as soon as a datum is collected, regardless of the circumstances, it should become public knowledge as promptly as it is recorded. At the other extreme, some others think that data are the property of the scientists who created them, and that the publication of a scientific paper doesn’t necessarily require dissemination of raw data.

Like in most matters, the opinions of most scientists probably lie somewhere between the two poles.

The status quo, for the moment, is that most scientists do not openly disseminate their raw data. In my field, most new papers that I encounter are not accompanied with fully downloadable raw datasets. However, some funding agencies are requiring the availability of raw data. There are a few journals of which I am aware that require all authors to archive data upon publication, and there are many that support but do not require archiving.

The access to other people’s data, without the need to interact with the creators of the data, is increasing in prevalence. As the situation evolves, folks on both sides are getting upset at the rate of change – either it’s too slow, or too quick, or in the wrong direction.

Regardless of the trajectory of “open science,” the fact remains that, at the moment, we are conducing research in a culture of data ownership. With some notable exceptions, the default expectation is that when data are collected, the scientist is not necessarily obligated to make these data available to others.

Even after a paper is published, there is no broadly accepted community standard that the data that resulted in the paper become public information. On what grounds do I assert this? Well, last year I had three papers come out, all of which are in reputable journals (Biotropica, Naturwissenschaften, and Oikos, if you’re curious). In the process of publishing these papers, nobody ever even hinted that I could or should share the data that I used to write these papers. This is pretty good evidence that publishing data is not yet standard practice, though things are slowly moving in that direction. As evidence, I just got an email from Oikos as a recent author asking me to fill out a survey to let them know how I feel about data archiving policies for the journal.

As far as the world is concerned, I still own the data from those three papers published last year. If you ask me for the data, I’d be glad to share them with you after a bit of conversation, but for the moment, for most journals it seems to be my choice. I don’t think any of those three journals have a policy indicating that I need to share my dataset with the public. I imagine this could change in the near future.

I was chatting with a collaborator a couple weeks ago (working on “paper i”) and we were trying to decide where we should send the paper. We talked about PLOS ONE. I’ve sent one paper to this journal, actually one of best papers. Then I heard about a new policy of the journal to require public archiving of datasets from all papers published in the journal.

All of sudden, I’m less excited about submitting to this journal. I’m not the only one to feel this way, you know.

Why am I sour on required data archiving? Well, for starters, it is more work for me. We did the field and lab work for this paper during 2007-2009. This is a side project for everybody involved and it’s taken a long time to get the activation energy to get this paper written, even if the results are super-cool.

Is that my fault that it’ll take more work to share the data? Sure, it’s my fault. I could have put more effort into data management from out outset. But I didn’t, as it would have been more effort, and kept me from doing as much science as I have done. It comes with temporal overhead. Much of the data were generated by an undergraduate researcher, a solid scientist with decent data management practices. But I was working with multiple undergraduates in the field in that period of time, and we were getting a lot done. I have no doubts in the validity of the science we are writing up, but I am entirely unthrilled about cleaning up the dataset and adding the details into the metadata for the uninitiated. And, our data are a combination of behavioral bioassays, GC-MS results from a collaborator, all kinds of ecological field measurements, weather over a period of months, and so on. To get these numbers into a downloadable and understandable condition would be, frankly, an annoying pain in the ass. And anybody working on these questions wouldn’t want the raw data anyway, and there’s no way these particular data would be useful in anybody’s meta analysis. It’d be a huge waste of my time.

Considering the time it takes me to get papers written, I think it’s cute that some people promoting data archiving have suggested a 1-year embargo after publication. (I realize that this is a standard timeframe for GenBank embargoes.) The implication is that within that one year, I should be able to use that dataset for all it’s worth before I share it with others. We may very well want to use these data to build a new project, and if I do, then it probably would be at least a year before we head back to the rainforest again to get that project done. At least with the pace of work in my lab, an embargo for less than five years would be useless to me.

Sometimes, I have more than one paper in mind when I am running a particular experiment. More often, when writing a paper, I discover the need to write different one involving the same dataset (Shhh. Don’t tell Jeremy Fox that I do this.) I research in a teaching institution, and things often happen at a slower pace than at the research institutions which are home to most “open science” advocates. Believe it or not, there are some key results from a 15-year old dataset that I am planning to write up in the next few years, whenever I have the chance to take a sabbatical. This dataset has already been featured in some other papers.

One of the standard arguments for publishing raw datasets is that the lack of full data sharing slows down the progress of science. It is true that, in the short term, more and better papers might be published if all datasets were freely downloadable. However, in the long term, would everybody be generating as much data as they are now? Speaking only for myself, if I realized that publishing a paper would require the sharing of all of the raw data that went into that paper, then I would be reluctant to collect large and high-risk datasets, because I wouldn’t be sure to get as large a payoff from that dataset once the data are accessible.

Science is hard. Doing science inside a teaching institution is even harder. I am prone isolation from the research community because of where I work. By making my data available to others online without any communication, what would be the effect of sharing all of my raw data? I could either become more integrated with my peers, or more isolated from them. If I knew that making my data freely downloadable would increase interactions with others, I’d do it in a heartbeat. But when my papers get downloaded and cited I’m usually oblivious to this fact until the paper comes out. I can only imagine that the same thing could happen with raw data, though the rates of download would be lower.

In the prevailing culture, when data are shared, along with some other substantial contributions, that’s standard grounds for authorship. While most guidelines indicate that providing data to a collaborator is not supposed to be grounds for authorship, the current practice is that it is grounds for authorship. One can argue that it isn’t fair nor is it right, but that is what happens. Plenty of journals require specification of individual author contributions and require that all authors had a substantial role beyond data contribution. However, this does not preclude that the people who provide data do not become authors.

In the culture of data ownership, the people who want to write papers using data in the hands of other scientists need to come to an agreement to gain access to these data. That agreement usually involves authorship. Researchers who create interesting and useful data – and data that are difficult to collect – can use those data as a bargaining chip for authorship. This might not be proper or right, and this might not fit the guidelines that are published by journals, but this is actually what happens.

This system is the one that  “open science” advocates want to change. There are some databases with massive amounts of ecological and genomic data that other people can use, and some people can go a long time without collecting their own data and just use the data of others. I’m fine with that. I’m also fine with not throwing my data in to the mix.

My data are hard-won, and the manuscripts are harder-won. I want to be sure that I have the fullest opportunity to use my data before anybody else has the opportunity. In today’s marketplace of science, having a dataset cited in a publication isn’t much credit at all. Not in the eyes of search committees, or my Dean, or the bulk of the research community. The discussion about the publication of raw data often avoids tacit facts about authorship and the culture of data ownership.

To be able to collect data and do science, I need grant money.

To get grant money, I need to give the appearance of scientific productivity.

To show scientific productivity, I need to publish a bunch of papers.

To publish a bunch of papers, I need to leverage my expertise to build collaborations.

To leverage my expertise to build collaborations, I need to have something of quality to offer.

To have something of quality to offer, I need to control access to the data that I have collected. I don’t want that to stop after publication.

The above model of scientific productivity is part of the culture of data ownership, in which I have developed my career at a teaching institution. I’m used to working amicably and collaboratively, and the level of territoriality in my subfields is quite low. I’ve read the arguments, but I don’t see how providing my data with no strings attached would somehow build more collaborations for me, and I don’t see how it would give me any assistance in the currency that matters. I am sure that “open science” advocates are wholly convinced that putting my data online would increase, rather than constrict opportunities for me. I am not convinced, yet, though I’m open to being convinced. I think what will convince me is seeing a change in the prevailing culture.

There is one absurdity to these concerns of mine, that I’m sure critics will have fun highlighting. I doubt many people would be downloading my data en masse. But, it’s not that outlandish, and people have done papers following up on my own work after communicating with me. I work at a field site where many other people work; a new paper comes out from this place every few days. I already am pooling data with others for collaborations. I’d like to think that people want to work with me because of what I can bring to the table other than my data, but I’m not keen on testing that working hypothesis.

Simply put, in today’s scientific rewards system, data are a currency. Advocates of sharing raw data may argue that public archiving is like an investment with this currency that will yield greater interest than a private investment. The factors that shape whether the yield is greater in a public or private investment of the currency of data are complicated. It would be overly simplistic to assert that I have nothing to lose and everything to gain by sharing my raw data without any strings attached.

While good things come to those who are generous, I also have relatively little to give, and I might not be doing myself or science a service if I go bankrupt. Anybody who has worked with me will report (I hope) that am inclusive and giving with what I have to offer. I’ve often emailed datasets without people even asking for them, without any restrictions or provisions. I want my data to be used widely. But even more, I want to be involved when that happens.

Because I run a small operation in a teaching institution, my research program experiences a set of structural disadvantages compared to colleagues at an R1 institution. The requirement to share data levies the disadvantage disproportionately against researchers like myself, and others with little funding to rapidly capitalize on the creation of quality data.

To grow a scientific paper, many ingredients are required. As grass grows the cow, data grows a scientific paper.

In Open Range, the resource in dispute is not the grass, but the cows. The bad guy ranchers aren’t upset about losing the grass, they just don’t want these interlopers on their land. It’s a matter of control and territoriality. At the moment, the status quo is that we run our own labs, and the data growing in these labs are also our property.

When people don’t want to release their data, they don’t care about the data itself. They care about the papers that could result from these data. I don’t care if people have numbers that I collect. What I care about is the notion that these numbers are scientifically useful, and that I wish to get scientific credit for the usefulness of these numbers. Once the data are public, there is scant credit for that work.

It takes plenty of time and effort to generate data. In my case, lots of sweat, and occasionally some venom and blood, is required to generate data. I also spend several weeks per year away from my family, which any parent should relate with. Many of the students who work with me also have made tremendous personal investments into the work as well. Generating data in my lab often comes at great personal expense. Right now, if we publicly archived data that were used in the creation of a new paper, we would not get appropriate credit in a currency of value in the academic marketplace.

When a pharmaceutical company develops a new drug, the structure of the drug is published. But the company has a twenty year patent and five years of exclusivity. It’s widely claimed – and believed – that without the potential for recouping the costs of work in developing medicines that pharmaceutical companies wouldn’t jump through all the regulatory hoops to get new drugs on the market. The patent provides incentive for drug production. Some organizations might make drugs out of the goodness of their hearts, but the free market is driven by dollars. An equivalent argument could be wagered for scientists wishing for a very long time window to reap the rewards of producing their own data.

In the United States, most meat that people consume doesn’t come from grass on the prairie, but from corn grown in an industrial agricultural setting. Likewise, most scientific papers that get published come from corn-fed data produced by a laboratory machine designed to crank out a high output of papers. Ranchers stay in business by producing a lot of corn, and maximizing the amount of cow tissue that can be grown with that corn. Scientists stay in business by cranking out lots of data and maximizing how many papers can be generated from those data.

Doing research in a small pond, my laboratory is ill equipped to compete with the massive corn-fed laboratories producing many heads of cattle. Last year was a good year for me, and I had three papers. That’s never going to be able to compete with labs at research institutions — including the ones advocating for strings-free access to everybody’s data.

The movement towards public data archiving is essentially pushing for the deprivatization of information. It’s the conversion of a private resource into a community resource. I’m not saying this is bad, but I am pointing out this is a big change. The change is biggest for small labs, in which each datum takes a relatively greater effort to produce, and even more effort to bring to publication.

So far, what I’ve written is predicated on the notion that researchers (or their employers) actually have ownership of the data that they create. So, who actually owns data? The answer to that question isn’t simple. It depends on who collected it, who funded the collection of the data, and where the data were published.

If I collect data on my own dime, then I own these data. If my data were collected under the funding support of an agency (or a branch of an agency) that doesn’t require the public sharing of the raw data, then I still own these data. If my data are published in a journal that doesn’t require the publication of raw data, I still own these data.

It’s fully within the charge of NIH, NSF, DOE, USDA, EPA and everyone else to require the open sharing of data collected under their support. However, federal funding doesn’t necessarily necessitate public ownership (see this comment in Erin McKiernan’s blog for more on that.) If my funding agency, or some federal regulation, requires that my raw data be available for free downloads, then I no longer own these data. The same is true if a journal has a similar requirement. Also, if I choose to give away my data, then I no longer own them.

So, who is in a position to tell me when I need to make my data public? My program officer, or my editor.

If you wish, you can make it your business by lobbying the editors of journals to change their practices, and you can lobby your lawmakers and federal agencies for them to require and enforce the publication of raw datasets.

I think it’s great when people choose to share data. I won’t argue with the community-level benefits, though the magnitude of these benefits to the community vary with the type of data. In my particular situation, when I weigh the scant benefit to the community relative to the greater cost (and potential losses) to my research program, the decision to stay the course is mighty sensible.

There are some well-reasoned folks, who want to increase the publication of raw datasets, who understand my concerns. If you don’t think you understand my concerns, you really need to read this paper. In this paper, they had four recommendations for the scientific community at large, all of which I love:

  1. Facilitate more flexible embargoes on archived data
  2. Encourage communication between data generators and re-users
  3. Disclose data re-use ethics
  4. Encourage increased recognition of publicly archived data.

(It’s funny, in this paper they refer to the publication of raw data as “PDA” (public data archiving), but at least here in the States, that acronym means something else.)

And they’re right, those things will need to happen before I consider publishing raw data voluntarily. Those are the exact items that I brought up as my own concerns in this post. The embargo period would need to be far longer, and I’d want some reassurance that the people using my data will actually contact me about it, and if it gets re-used, that I have a genuine opportunity for collaboration as long as my data are a big enough piece. And, of course, if I don’t collaborate, then the form of credit in the scientific community will need to be greater than what happens now, which is getting just cited.

The Open Data Institute says that “If you are publishing open data, you are usually doing so because you want people to reuse it.” And I’d love for that to happen. But I wouldn’t want it to happen without me, because in my particular niche in the research community, the chance to work with other scientists is particularly valuable. I’d prefer that my data to be reused less often than more often, as long as that restriction enabled me more chances to work directly with others.

Scientists at teaching institutions have a hard time earning respect as researchers (see this post and read the comments for more on that topic). By sharing my data, I realize that I can engender more respect. But I also open myself up to being used. When my data are important to others, then my colleagues contact me. If anybody feels that contacting me isn’t necessary, then my data are not apparently necessary.

Is public data archiving here to stay, or is it a passing fad? That is not entirely clear.

There is a vocal minority that has done a lot to promote the free flow of raw data, but most practicing scientists are not on board this train. I would guess that the movement will grow into an establishment practice, but science is an odd mix of the revolutionary and the conservative. Since public data archiving is a practice that takes extra time and effort, and publishing already takes a lot work, the only way will catch on is if it is required. If a particular journal or agency wants me to share my data, then I will do so. But I’m not, yet, convinced that it is in my interest.

I hope that, in the future, I’ll be able to write a post in which I’m explaining why it’s in my interest to publish my raw data.

The day may come when I provide all of my data for free downloads, but that day is not today.

I am not picking up a gun in this range war. I’ll just keep grazing my little herd of cows in a large fragment of rainforest in Sarapiquí, Costa Rica until this war gets settled. In the meantime, if you have a project in mind involving some work I’ve done, please drop me a line. I’m always looking for engaged collaborators.

68 thoughts on “I own my data, until I don’t.

  1. So very true – I have a 20 year data set that I am happy to share with potential collaborators but not happy about just putting it out there without being co-authored – it took a lot of hard work in all kinds of weather to collect

  2. A lot of food for thought there, Terry, which I need to digest as by and large I’ve been in favour of data archiving and have done some myself. But I did want to pick up on one thing you said:

    “So far, what I’ve written is predicated on the notion that researchers (or their employers) actually have ownership of the data that they create. So, who actually owns data?”

    Of the possible “owners” of the data that you then list, your university is not one of them. Is that because they don’t attempt to have a claim on your intellectual property? The trend in the UK, my own institution included, is that all intellectual property (including teaching materials and data) collected by salaried employees is property of the institution. As you might imagine, that’s not a popular policy amongst academics! And in truth it’s not a policy that has often been followed through, at least in our area. If it were an area that had money making potential, however, it would be a different story.

    • Actually, no, not my university, in my case. Oddly enough, according to our collective bargaining agreement, research is not part if our workload. Somehow, we are expected to do scholarship though it is not what we are paid to do and we can lose our jobs if we don’t do enough of it. At some US universities, it’s possible for faculty to have rights to all of their intellectual work, including profitable patents.

    • Just to be clear, I’m not against public data archiving. I think it’s a great movement. It just doesn’t seem to be in the interest of my lab, at the moment.

  3. Terry – a nice post that, I think, articulates some of my own concerns. I wonder if it’s also part of the increasing culture of non-contact that’s prevailing – someone is much more likely to email than phone (or come see me from upstairs!). I think some (many?) would be more comfortable grabbing a dataset than engaging in the data creator, especially if the conclusions/direction were different (and especially so if this was known up-front).

    • Interesting remark, Alex. I was kind of wondering about this when it comes to post-publication “review” as well (https://dynamicecology.wordpress.com/2014/02/24/post-publication-review-signs-of-the-times/), but was hesitant to raise it even as a speculative possibility. Does the push for authors to document what they’ve done in exhaustive detail (and to make the raw data available for download by anyone) arise in part because for some reason people are increasingly reluctant to even email people they don’t know? And if so (and I emphasize I have no idea if it is so), why is that? Is it just impatience–it seems quicker to read the methods and download the ms than to have to email or call the author to ask for clarification or the raw data?

      • Thanks Jeremy – I thought of that post of your this afternoon after I posted my comment. Not sure I know the answer, either.

      • Jeremy, I suspect the push for authors to provide data in repositories instead of on request is rooted in the literature on the subject more than any perception of culture. Numerous studies that continually show response rates to such requests at around or below 20%. (Here’s a recent one in phylogenetics http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001636 ). Many of the journals implementing these policies have explained their reasoning in accompanying editorials such as this one from AmNat, http://doi.org/10.1086/650340, citing literature providing other justifications as well.

        I’m not really clear on what is so novel about the idea of “exhaustive detail”. Our classic literature like Gause or Kermack & McKendrick etc all did quite good jobs of documenting what they did in “exhaustive detail”, from derivations to data. Journal articles seem to have gotten shorter while methods and data have gotten longer. When did it become common-place to just leave out the details?

  4. Hi Terry,

    OK, you don’t want to share your personal data because others could profit from that, and that would give you an relative disadvantage that you consider unfair, fair enough. A few points though:

    1) “If you ask me for the data, I’d be glad to share them with you after a bit of conversation, but for the moment, for most journals it seems to be my choice.” – as far as I know, no journal give you the choice to decline handing over your data if someone requests those to replicate your results. Are you seriously saying you would refuse handing over your data if someone says he wants to check the conclusions of your study?

    2) “If I collect data on my own dime, then I own these data.” – You might feel that way, but practically no legal system worldwide attaches intellectual property rights to raw data, see e.g. http://www.esa.int/About_Us/Industry/Intellectual_Property_Rights/Raw_corrected_and_treated_data . You are free to keep the data for yourself, but if you publish with them, you have to hand them over, and then there is no copyright on them. And even so, did YOU really collect those data personally, or did your students or field assistants collect it?

    3) “In Open Range, the resource in dispute is not the grass, but the cows. The bad guy ranchers aren’t upset about losing the grass, they just don’t want these interlopers on their land. It’s a matter of control and territoriality. At the moment, the status quo is that we run our own labs, and the data growing in these labs are also our property.” – this is completely misleading analogy. In an open range system, we typically have the situation where a public good declines by open access, i.e. some people overuse, and then we get into a tragedy of the commons situation where the total welfare is lower than for close access. Do you really think that the total scientific welfare, i.e. the progress of science overall, would hindered by open data? I think you just object to having an unfair disadvantage compared to other people that free-ride on your work, but this is no argument against open data, but rather for better credit.

    4) Finally, just curious, do you find you profit from people that make software like R packages available to other people? Do you think a situation where you would have to email each package provider to discuss with him what you want to do with his software, and then include him as coauthor would be preferable?

    • 1. Your question is about a different matter. If you read later on, I wrote that my editor could tell me whether or not I had to share. If there was a question of appropriate analysis and replicability, and the journal required me to hand it over, I would. If you told the editor that you simply wanted to write a new paper, then I do to think the editor would require me to hand it over.

      2. You’re right, I’m not writing about legal definitions of ownership. I’m referring to workable access to my data in practical terms on a day to day basis. My students have full right to their data, as does my lab, except when they are collected by a paid technician, which is rare in my lab. As for data and authorship, I work out with students in advance an understanding in advance. There is a pos frm about a year ago on this.

      3. I’m sorry you don’t like my analogy. You are right, it doesn’t fit perfectly. You’ve made the assumption that free grazing results in a decline in public good. Maybe if all we had were free grazers, with no grazing rights, the land would be better managed? I don’t know, and that’s a continental scale experiment that hasn’t been run. Just like the experiment with no data rights for anyone, which hadn’t been run. I honestly don’t know what’s better for science. I know however, in the current environment, what is best for my lab.

      4. The only way I profit from those who wrote R packages, at the moment, is by reading the work that others do. I suspect they you’re fully aware that the sharing of R programming has some fundamental differences with the sharing of original scientific data. The existence of R itself is a medium for community sharing. It’s like the NEON or LTER but for data analysis, designed as a public resource. I didn’t design my experiments for their data to be a public resource, except for the findings. And, yes, if someone wrote a non-public R package to do an analysis that otherwise couldn’t be done, and the authors wanted to collaborate with that person to get the project done, I think that’s fine. In fact, I’ve brought in coauthors for the purpose of statistical mojo, who had access to tools that I don’t have working access to. If you’re interested, let me know, I have a project or two in mind.

      • Hi Terry,

        I think if I write to an author that I would like to have the data to check his conclusions, I should get it, I don’t see the need for involving the editor. And according to 2), I think the author couldn’t do anything if I’d then used this data for other purposes. I should say that I’ve never done this and I would consider it quite rude to do so, but just to make the argument that once you published with a dataset, you should assume that you can be forced to hand it over already now in any journal. All that PLOS is doing is enforcing this, because at the moment many authors don’t hand over their data when asked, for whatever reason, maybe the PhD student had the data and left academia, maybe it has really been lost, or maybe they just don’t want to hand it out.

        About 3 – I don’t see how open data could do us any harm, and it seems most likely beneficial. And if it is, data sharing is sensible for the community and should be enforced – it’s a question of establishing a public good. When this results in an unfair distribution of credit, we should find other ways to fix this, I’m all in for that, its not that I don’t see the work that is connected to obtaining empirical data, but not by decreasing a public good for the benefit of single individuals that have, after all, obtained this data while being paid by the public.

        About 4 – I don’t think the analogy is so bad – most field ecologists are quite happy to use the software of statistical ecologists, many times without even citing them, certainly most times without involving them as a coauthor. R and its packages are different from NEON, NEON is funded for creating open data, with R most people contribute voluntarily things they did in their “free time”, i,e, work hours. Everything that is said about open data could also be said about software – it costs a lot of time to write it, people might use it in a wrong way, and you might not get enough credit for it. But overall, I think everyone agrees that we have made huge progress by sharing statistical software among ecologists freely, and the situation is much preferable compared to one where everyone is sitting on his software.

        Happy to contribute mojo to your ideas, get in touch if you want.

        • I think the author couldn’t do anything if I’d then used this data for other purposes. I should say that I’ve never done this and I would consider it quite rude to do so, but just to make the argument that once you published with a dataset, you should assume that you can be forced to hand it over already now in any journal.

          This is is precisely why public data archiving is problematic.

          PLOS ONE is doing more than enforcing a policy of authors being required to share data with those who wish to validate the results. To enforce that policy, all they would need to do is retract any paper that authors are not willing to share information to indicate that the results were accurate. They are providing the entire world unfettered access to all of the data behind the results in the journal. That’s a different thing than enforcing a policy of sharing data with those who need to ensure validity.

      • With response to point 3 (both Florian and your response), I have a couple of thoughts.

        Like Florian pointed out, with digital copies of the data there is no direct loss of the resource itself (the data), so the analogy is not (as you point our) perfect. So, only if other scientists actually do the analyses you would actually do would this case be reasonable. However, as others have pointed out, there are many cases where such data sharing has resulted in the people who generated data being scooped.

        Additionally, like you, much of the data we work with (to test our questions) takes a long time to generate. Whether it is morphology or sequence or behavioral data, it is often from populations or strains that took many years to generate, and (in particular for morphology and behaviour) a long time to measure. I have loads of additional analyses in mind with these data (in particular in this funding climate where collecting new data may get increasingly difficult). However, as I have not yet been scooped from sharing any of my data, I would much rather put the data out there for others to use. Indeed I hope they use it. Hopefully, I will not end up being scooped but I like to think my data is good quality and has a great deal of value for many questions (as I am sure you feel about the data you have collected). I get that we just have a different perspective on this, but as others have said, I do not think you have as much to fear in being scooped as you might think.

        • This should read

          “there are NOT many cases where such data sharing has resulted in the people who generated data being scooped.”

          Kind of a change in meaning! Sorry about that

  5. One good test of intent is where the OpenEverything waccaloon comes down on authorship expectations in the New World Order of their heart’s desire.

    • And the argument from the specific to the general: data sharing is good for me, and for the community, so it must be good for you, too!

  6. “data sharing is good for me, and for the community, so it must be good for you, too!”

    It’s more like: “data sharing is good for me, and for the community, and these benefits greatly outweigh the minor inconvenience and risk you face when forced to share”.

    • That’s a better phrasing, which people should use more often. And most of the people who should use that phasing are not in a position to evaluate the inconvenience or risk for my lab.

      • OK, but in the above you more-or-less assume that as soon as you make any dataset available someone else will publish the paper you were intending to write, without offering co-authorship. I’ve never caught wind of this happening for any of the thousand or so datasets we’ve made available at Mol Ecol, so it’s presumably quite rare. So, while you’re certainly well placed to judge the inconvenience, I think you might be overestimating the risk of being scooped.

        • I think you might want to reread. There are a couple paragraphs that totally contradict your claims about my assumptions. (I’d paste them in here, but one the themes on this site is that knowledge from inquiry-based approaches is far more likely to result in deep learning.)

  7. I like your post, although I think data sharing is generally a good thing, even as a requirement. I would however like to install one rule: everyone who uses shared data (in particular data shared by coercion, no negative connotation intended) read the accompanying paper. Probably there’s a lot of intricacies that you simply can’t ignore. In previous times these intricacies would have naturally been sorted out through communication. In absence of this, I think the worst thing you can do to a data-collector is to unknowingly draw false conclusions from the data, which he/she took the effort to write a paper around :-)

    • “…to install one rule: everyone who uses shared data (in particular data… ”
      How about: Until # of papers published is no longer the primary currency in science career success, everyone who uses shared data must, within 2 years, contribute an equal amount of (a comparable type of) data to the the pool.
      Until we have that rule, it seems that some researchers are being asked to risk their own futures for the “good of science” and to the advantage of other researchers. A sacrifice that is being called an “inconvenience”. Another potential danger to the “good of science” that is being overlooked is the diminished motivation to get these difficult, complicated data sets.

  8. Thanks for a detailed write-up on this important issue.

    Just wondered if you could clarify one thing here. Oikos says very clearly in their “Author Guidelines” linked from the journal page, http://www.oikosjournal.org/authors/author-guidelines “Oikos requires authors to deposit the data supporting the results in the paper in a publically accessible archive, such as Dryad (DataDryad.Org). Derived, summary data may also be archived. DNA sequences published in Oikos should be deposited in the EMBL/GenBank/DDJB Nucleotide Sequence Databases. An accession number for each sequence must be included in the manuscript.” To me, that qualifies as asking you to share the data from which the results were derived. Did I misinterpret something here? (I didn’t check the other two journals you mentioned).

    Many journals have similar policies in their author guidelines, including Nature and Science (the latter of which recently added “code” to the description). Editors and reviewers are not often asked to check or enforce all of these guidelines, trusting the authors to comply (just as we trust the authors have the data when they are not asked to show it).

    • Huh! I missed that. I wonder if that’s changed since we submitted, maybe a year ago? No editor or reviewer brought it up, and I usually browse instructions to authors each time I write a new manuscript.

      Of course I’d comply if they require for a new one, when/if I submit there. The landscape is evolving.

  9. I guess this is why I am more confused. As you state in your blog, you are both worried about the POTENTIAL for some level of being scooped, but then state that while the actual likelihood of it is very low. So are you saying this is not one of the concerns then? I have re-read your post, but it seems that you are making two points that are somewhat at odds, no?

      • While I do get your point that despite the risks being low, they perhaps might be non-negligible. However, I think the difference in the severity of negative outcomes should come into play here. Surely equating someone using your data from a published study with dying in a plane crash in terms of severity of consequence is a bit much?

        In any case, I do appreciate your perspective, even if I disagree. Happily will discuss this more over a beer sometime, but I am sure we both want to get back to doing some science!

  10. Hi Terry,
    Thanks for an interesting post and the discussion in this thread. Yesterday I started to write a comment, but it became a blog post instead (http://zoonoticecology.wordpress.com/2014/03/04/to-share-or-not-to-share-your-data-some-thoughts-on-the-new-data-policy-for-the-plos-journals/).

    I have mixed feelings about the new data policy too. On the one hand it could solve a problem of lost data and sparking new research, on the other hand it comes with a potential cost for the individual researcher that collected the data. It all boils down to the definition of the ‘minimal dataset’. If publishing a minor study means that I have to submit all my data on mallards and flu, some 22,000 data posts collected over 12 years, I will be hesitant to do that until I know I have done all the major analyses first.

    I hope we will learn more soon about what a minimal dataset is in practice.

    • Yes, I think this is how mandatory open access for data could actual slow the progress of scientific discovery – that is, if I know that I have to make all the data from a large study available as soon as the first paper from it is published, then it would be more sensible for me (as a teaching/research academic with only 1 day research/week) to sit on that paper until I have written the rest the papers I want to get out of the data set.

      Another angle to this debate, which is of concern to me, is that teaching-focused and cash-strapped universities (like where I work) may try to go down a MOOC-like path on research, deciding that designing and executing our own studies is inefficient/unproductive/too expensive and therefore pushing us to restrict ourselves to research questions that can be answered through publicly available data. If you don’t think this is likely, my university had a guest speaker at a workshop last year pushing the message that the future of research lay in data mining publicly available data sets.

    • Jonas,

      As far as I understand the PLoS guidelines (included the revisions), as well as the data archiving guidelines for journals I regularly publish in (Evolution, Genetics, JEB), it only requires you to include data necessary to replicate the work in the study you are publishing. So if you only used a subset of your data for that paper, it is only that specific subset of data (not the whole dataset from your experiment) that needs to be archived.

  11. Hi Dominique. Thanks for pointing to the survey!

    It is hard to tell from the context of the slides, but importantly the 15% scooping of present or future research plans result was in response to a question about *data sharing via any mechanism* not just public data archiving. Specifically, the question reads “Include experiences from datasets shared outside your research groups through any mechanism, including public archiving, selected distribution, or shared individually upon request (for example, in response to an email request).”

    So if anyone else in my field asked for my data, and I provided it, and then they built on it in a way that I’d thought I might want to do in the future, I’d have responded yes to the question. Crucially, it is not measuring an outcome of public data archiving requirements.

    Guessing this might also be a result that is higher than expected due to response bias… people who have had a painful scooping experience are probably way more likely to fill out an “attitudes on data sharing” survey. We’ll try to put it in context when we write up the full results.

    • Thanks very much for the clarifications, Heather. Good luck with the rest of the survey – looking forward to reading about the results.

    • Thanks for the clarification, Heather. Interestingly, I think this highlights another worry that some people may have about sharing data before publication and/or without discussion with authors: sometimes data can be misinterpreted based on what is available online. I’ve been thinking about this recently as I have archived some data on figshare from two submitted, but as of yet unpublished, manuscripts so these can be available for reviewers (and also plan to do so for several more datasets). The associated metadata files should help re-users understand what I have done without having read the entire manuscript, but data generators have an intimate knowledge of how data was collected and curated that can be difficult to describe fully, in metadata or even in a manuscript.
      The good news, which I think this situation also demonstrates nicely, is that open, online discussions between re-users and generators appears to be a good way to ensure that data is available, and interpretations are correct :)

    • Thanks for sharing this! Quite relevant. Here’s one of your key recommendations:

      Coauthorship should be offered to data providers if their data are integral to the manuscript and if they can meet the other conditions of authorship, such as participation in the preparation of the manuscript and acceptance of responsibility for the conclusions reached.

      When there is reasonable assurance that this becomes standard practice, and and accepted community norm, and we all can agree on what “integral to the manuscript” means, then this would rock.

    • Just finished reading this- really enlightening, thanks! Covers a lot of concerns that have been voiced, including mine.

  12. What really disturbs me is this division of researchers into “data generators” and “data analyzers” that I keep hearing. What does that even mean? I am a scientist who thinks about scientific questions, develops hypotheses, designs experiments, generates data, analyzes data, publishes results, refines hypotheses, collects more data and/or reanalyzes data, publishes results….I really enjoy the entire process and don’t appreciate being rushed through the latter stages because someone has decided I am a mindless data collecting robot.

    maybe those who only analyze and publish have too much time on their hands. I love how they say “we can’t all do everything” but then they don’t want to collaborate, they just want the data free and clear. I don’t mind using statistical methods that have been around awhile-why can’t you wait a bit for new data to analyze? I spent years on these experiments. If someone wants to reanalyze my results because they have an idea in mind, and before I am satisfied with my own analyses and am ready to move on, maybe they should volunteer their services and offer to collaborate.

    • I have no idea what you’re responding to, but it doesn’t sound like it is a response to this post. I agree with much of what you say, though.

      • Sorry for being unclear- I was responding to the various posts and tweets in the discussion in general. I keep seeing this distinction being made as if there were two distinct types of scientists. I don’t consider myself an empiricist (vs a theorist) or a “data generator.” Seems to be a weird underpinning to the discussion at least from some perspectives. Not yours of course!

  13. Thanks for a thoughtful & insightful write up. It’s a difficult and nuanced topic.

    I’m not a scientist. I’m ‘just’ a programmer. As you observe, the issue has similarities with open source code. Much the same considerations that go through my mind when I’m considering whether to release programs or libraries I’ve written.

    By personal observation in that context, it has become very clear to me that the code I’ve released as open source has prospered and blossomed way more prolifically than anything I’ve retained ownership of. If my goal was to change the world, or have an impact, my open code has achieved magnitufrs more thsn my closed code has.

    While I’m not able to monetize that open source directly by selling it, it has boosted my reputation and my experience within my field (such as it is) in a way that I can only describe as ‘transformational’. I would not be anything like as good a programmer without it. I wouldn’t have the professional or social networks. It has been instrumental in getting me jobs, speaking at conferences, and the like.

    When I read your descriptions of the ‘pain in the ass’ that thorough data management can entail, I sympathise, but must confess I wonder whether this might in some way be comparable to peripheral housekeeping tasks in software, such as version control & continuous automated testing. These often elicit complaints from less experienced programmers, wondering why they should waste time on them when they could be being productive instead. However, experience at performing those tasks diligently shows that (a) the effort drops as you develop habits and tools to perform them routinely, and (b) the effort is repaid many fold, as whole new techniques are made possible (such as ‘bisecting’ to find the code responsible for Introducing a bug without even looking at your code – a feat that seems magic when you first see it.)

    To what extent do you think that the effort and losses of releasing your data might be offset by the benefits you would obtain from being given unfettered access to everyone else’s data? I think this is a crucial measure. If the net is positive, then the situation is akin to open source code. and would benefit everyone. But if the net benefit isn’t positive, then maybe open data doesn’t work like open code.

  14. Lots of things to chew on! I suspect that the administration side of the academy is worried and unsure about how to deal with these issues too! I think it is important that we help to brainstorm ways that we think data sharing could be acknowledged and respected at the institutional level. Are there ways that you see this could be incentivised and included in tenure/promotion in a positive way at a teaching focused institution?

    • We’re always five steps behind the broader research community in how scholarship is recognized. At my campus, h-scores are still unknown, and they still don’t understand that being the last author is a senior authorship. Credit for having provided datasets? Millennia away. I’m okay with that, though. I think just having credit within the broader research community would help, and more importantly a prevailing culture that you actually contact and ask to collaborate with the providers of data. This sometimes happens, I know, but there are too many people out there who say they don’t want to, don’t have to, and give me your data or else. Until the more zealous people who say than can, should and will take data without consulting with the data creators are set aside by the broader majority through clear editorial policies and practices, I’m not inclined to participate unless I have to. Which is happening at the “better” journals already, I realize.

      • Terry,

        I am curious whether your view on this would be somewhat different if there was a mechanism that data re-use was evaluated as part of your productivity. In other words, there was some mechanism in between citation of your paper when it is being re-used, and co-authorship? I recognize your point about evaluation (and how slow it is to change) on your campus, and that such an idea may not be fruitful for you in the next few years. But if funding agencies and the broader scientific community (or at least within ecology and evolution) had some sort of explicit mechanism recognizing the “value” of data you collected based on how it has been re-used?

        Putting aside (temporarily), how your particular institution might not know how to evaluate it in the short term, would something like this work for you?

        • Yes, exactly. And anything on my campus trickles down from the broader research culture, very slowly. I think something broadly seen as more than a citation, but less (or different) than authorship, would fit the bill and address the concerns of mine here and those of others. I’m not sure how we get there, but this would facilitate sharing to be sure.

  15. I don’t think data providers should be co-authors on all papers using their data, but some data providers do have this explicitly in contracts: http://bit.ly/1hHS1gY [see the joint ownership of publication condition]. The disadvantage of a condition like that is when a potential user has a different point of view. Will you really share the data with them?

    And, to that point, releasing only the segment you used on your publication isn’t necessarily satisfactory either. Perhaps your results are true for species x in your data but not species y or perhaps there is an interaction you didn’t consider.

    Finally, sometime in the future the use of altmetrics [http://altmetrics.org/manifesto/] will give credit to more than just published journal articles. Data, data tools, blog posts, r-code, etc. will all be part of a scientist’s portfolio.

  16. Sorry to come late to the party, but great post. Is the primary issue not the open sharing of data per se, but the transition between the current system and the utopic future? If we ecologists were all starting again from scratch, would you agree that a system that shares data completely and freely and immediately would be more productive, cross-referenced and robust?

    The only counter-argument I can think of is that if all data were immediately shared, people would collect less (since the return-on-investment would be smaller). But this could be compensated for by evolving metrics of reward, or a greater shift towards collective data-gathering.

    • Michael, it sounds like you are advocating for a world where scientists specialize into various roles instead of each seeing a project/question/experiment through from start to finish. So some people will be data collectors, and others will be churning out analyses (and papers). And you envision s reward system for the data collectors that equals churning out papers.

      If I am understanding you correctly, this sounds like more than an awkward transition. What reward system is going to ever equal all those papers? And did you ever think some people might be motivated to collect data because they have a question and did it ever occur to you that they are looking forward to the intellectual excitement and satisfaction of analyzing and interpreting their data?

      • We find ways to reward open access code-writing, why not open access data creation?

        Although I don’t think this would require specialisation, I do think that, in general, specialisation is a good idea. The division of labour takes advantage of people’s natural advantages, and allows for more complex and focused skill development. Academic ecology’s support for a craftsman-style approach places priority on precisely what you mention: the individual pursuit of intellectual excitement. But this must be traded-off against slower overall scientific progress, and that’s difficult to justify when we’re spending taxpayer’s money, or when we’re purportedly worried about biodiversity conservation challenges.

Leave a Reply to Terry McGlynnCancel reply