Keeping data readable in the long run

(image by frankieleon)

This is reminder of the obvious, but perhaps one some of you could use: Be sure to save your files in a format, and on a medium, that you can read in the future.

Let me tell you a little story about an email that I got this week.

I heard from a colleague who was working on a project related to the first paper I published (which was a note about velvet worms that I kept finding while I was collecting ants; much of field biology of Neotropical velvet worms remains undocumented).

They had a couple questions about the paper, because the discussion directly contradicts the results! This is all a vague whiff of memory, as I did the fieldwork literally twenty years ago. When I took a looked at this discrepancy. I detected an error in one of the tables. (It showed density is 7ish when it really was 2ish).

I had two thoughts: First, funny that this has never come up before! Second, I was curious if this was a typesetting error or was it an error in the file that I sent to the journal. I took a quick look to my hard drive to figure this out.

But I couldn’t.

I found the file that I submitted, and according to the metadata recovered from the jumbled text in the file, I wrote the manuscript in WordPerfect 3.5. I had run the analyses in Statview, and created figures in Cricket Graph III. These were all mightily useful tools that did what I needed at the time, and none of them work now. (This short and surprising 2013 conversation about Cricket Graph tidily sums up this set challenges, and by the way, I just found an original in-box copy on sale.) I suppose if it really mattered, I could try to get an emulators up and running, and so on, but the best case scenario here would be a huge hassle, and the worst case scenario is lost data. Actually, the best way to answer this question is probably to find a printout of the manuscript (which I submitted by post) in a file cabinet in my lab. I haven’t opened that file cabinet in several years.

While the word processing, analysis, and graphic files weren’t openable, it turns out that I saved the associated data in a .txt file. I didn’t give much thought to these files, as this velvet worm paper was a one-off. In my defense, I’d like to point out that for all of my other work, I’ve created archives of the data as .csv files. Which are on my hard drive, backed up every day to to a separate hard drive, and also in dropbox. And also hard copies of all of these data are shelved in my office. I don’t think I have any old data that I can’t open because it’s in the wrong format.

I started paying close attention to data curation after getting spooked by the prospect of losing material to the Year 2000 problem, otherwise known as the Y2K bug. Just in case some of my programs stopped working, I wanted to make sure I didn’t lose anything. And I’ve stayed in the habit. (Huh. I just noticed that I can’t open the word processing files for my dissertation as well. No loss there, as far as I’m concerned.)

Keep in mind that files that we think might have a high longevity may not. For example, will R scripts be useful 20 or 50 years from now? Is whatever lives in google docs going to be there for you forever in a usable format?

Moreover, consider that pretty much every medium that we are storing our files on now won’t work in the near future. When I was doing my dissertation, I kept files on 3.5 inch floppies, and for bigger stuff, on Zip disks, if you happen to know what those are. You could get fancy and back things up by burning them to a CD if need be. I still have ton of those floppies and zip disks kicking around a box in my lab. Is it too late for those formats? And I have a USB Zip drive that I haven’t plugged in for several years, I am guessing it might still work? And I heard a rumor that a guy down the hall in Computer Science has a USB 3.5 floppy drive. But as far as I know, all of the files from those disks that I might want also is on my hard drive, so I’m not fussing over it. I suppose I should get 10-year old movies of my kid off of DV tapes onto a hard drive before it’s too late.

I doubt that, 20 years from now, we’ll be using DVDs, USB drives, SD cards, optical drives, and whatever else we’re using today, at least not in a way that’s easily readable. Are we going to be using pdfs? (I think fastlane will remain unchanged :) )

Keep in mind that this is just me writing vaguely about this in my blog, but this is also a professional matter that falls under the expertise of librarians, who are responsible for the long-term of maintenance and curation of digital information. I’m just sayin’, store your files in a text format so that you and others will have better prospects for opening them in the future, and make sure they’re archived in the medium that you’re using at the moment, and re-archive as your medium evolves. Because the long-term prospects for whatever we’re using now, “open” and otherwise, are grim.

7 thoughts on “Keeping data readable in the long run”

CuriousGeorge 9 years ago

To answer your question, “yes”.

WordPerfect, Statview, and Cricket Graph are all proprietary, but because R is free and open, R scripts will still be useful (where “useful” is defined in terms of serving to make research reproduceable) in 30 years, just like Fortran, C, and COBOL scripts all still run today.

The caveat being that the user needs to specify which R version was used, along with which versions of which R packages were used. And of course the data also needs to be publicly archived. (And public data archives need to be institutionally supported.)
Jeremy Fox 9 years ago

It’s interesting and sobering to think about how we can still easily read books printed hundreds of years ago, but can’t easily read documents created much more recently with Wordstar or SuperAnova (to name two programs I used to use) and stored on 5.25″ floppies or 3.5″ floppies or etc.

On the other hand, hard to say how much we’re missing out on, individually or collectively, not being able to access that old data. For instance, I bet the chemistry data from old alchemical experiments is mostly lost to the sands of time–and that nobody besides historians of alchemy has any reason to care today.
Chris Mebane 9 years ago

There’s a good animated cartoon “Data Sharing and Management Snafu in 3 Short Acts” that acts out a scenario similar to the velvet worms.
Zetta@Bean2014 9 years ago

As noted by @CuriousGeorge, R files will open as regular text files. So will the increasingly common RMarkdown files (.Rmd); Its not the file extension that matters (.R, .Rmd) that matters. One finer point that may be worth considering: many (most?) R users save their data in comma-separated .csv files, and I believe Gotelli and Ellison in Primer of Ecological Statistics discuss how it may be better to use tab or space delimited formats to make the data more universally accessible. There is some discussion of this on Crossvalidated.
https://stats.stackexchange.com/questions/182970/why-is-a-comma-a-bad-record-separator-delimiter
Terry McGlynn 9 years ago Author

Chris, just to follow up, I just watched this video. My situation is more than a little different. Within minutes of receiving an email with a request for the original data, I sent the data to the authors. The file had clearly labeled headers and appropriate metadata.
Pingback: Small Pond Science’s Greatest Hits of 2017 | Small Pond Science
Peter Lundberg 3 years ago

I have identical problems. Heaps of data and plots in Cricket Graph. Not readible! Same with my thesis in MS Word 3 (Mac). Not readible, which is a shame, as it is now available only in microfiche. As far as I understand, however, Jim Rafferty (one of the original authors of CG) is working on some level of a CG-compatible software.

73, Peter