Open source software doesn’t necessarily mean we’ll have better stats

Standard

black-box-310220_960_720Let me tell you a little story about data analysis and peer review, that’s giving me a little pause. Last year, I analyzed a dataset from a project that my lab did with a couple students. I got a cool result. Since then, we’ve taken the work in a new fruitful direction, while the paper is in the queue*.

The most interesting findings were part of an analysis that I conducted in EstimateS, with some subsequent tests in JMP. If my understanding is correct, this is an opaque way of conducting the analyses that advocates of “open science” frown upon, because they can’t see all of the code that was used to conduct the analysis. (EstimateS tells you what and how it calculates values, but the raw code isn’t available, I don’t think? And JMP doesn’t have open code. Also, if I’m painting “open science” with strokes that are too broad, please let me know, as I do recognize that “open science” is not one thing.) In this study, the responses to our treatments were unambiguously different, in an unexpected and robust manner.

My result was exciting. I shared it with my co-author, who asked for the data files so he could verify it in R. He didn’t. He came back to me with some results that seemed just as correct as the results that I got, but not nearly as interesting, and not showing any real differences among treatments. But it looked just as credible, and perhaps moreso because it wasn’t exciting.

That was a head-scratcher, because, after a quick look through my numbers, it looked like I had gotten it right and there was no reason to think that I didn’t. And quite reasonably, my collaborator thought the same thing.

So what did I do? Well, to be honest, I just put it on very low heat on the backburner. Didn’t even think or look at it. I had a bunch of grants, other manuscripts, and teaching to focus on. There were other proejcts, just as exciting, that were awaiting my attention. I thought I’d get to it someday.

Fast forward maybe half a year. My coauthor contacted me and said something like, “I went through that analysis against to see why our results were different, and I found the smallest error in my code. Now my results match yours.”

So, problem solved. One more year, and this manuscript is being shifted to the frontburner, now that I’m on sabbatical. When we write up the results, it won’t matter if they came from JMP or an R package, because we got the exact same thing. I guess I’ll say I used both, to cover my bases. Because reviewers have their biases. I imagine – or at least I hope – that the software I used to conduct the analyses doesn’t hinder me in review.

As I’m shifting stats to R over this next year, here’s a worry: Is shifting to an “open” platform actually going to result in better statistics? Isn’t it quite possible that we are going to see bigger statistical problems than we have in the past as more people who aren’t programmers are writing their own code? While you might want to say, “That’s why all scientists should be programmers,” if you chat with a variety of early-career scientists, there are many who are not professional coders and don’t have this as a high priority. Yes we should be more literate in this respect, but if we have expectations that don’t reflect reality then we might be fooling ourselves.

Consider my experience. Let’s say I relied on my collaborator who was using tried-and-trusted-and-peer-reviewed R packages. When the results came in, I’d write them up, and our paper would be out, and that would be it. And it would have been entirely wrong. Most journals don’t ask you to upload your code or to share it during the peer review process. But let’s say we did, what are the odds that this teensy coding error would have been caught during review? It’s quite possible that even robust reviewers who know what they’re doing could miss a slight syntax mistake that resulted in incorrect but credible-looking results. We are not positioned to require peer reviewers to scrutinize every single line of code on github associated with every manuscript in peer review.

I think the most forceful argument for “open software” is that transparency is required for quality peer review and the ability to replicate findings. If you can’t see exactly how the analysis was conducted, then you can’t be confident in it, the argument goes. In theory, if I only provided data and results, that would be a problem if I don’t share the code from the software.

That’s a robust argument, but based on my own experience, I don’t know if that scenario gives rise to a bigger problem. Sure, when we run analyses through (say) JMP or SPSS, it’s a black box. But it’s a mighty robust black box. While some folks have identified problems with statistical formulas in a popular spreadsheet (Excel) that were fixed a long time ago, I’m not aware that statistical results from widely-used statistical software contain any errors. If you can’t trust whether SAS has an error when running a regression because you can’t see the code, then you’ve got trust issues that transcend practicality. People can fail to understand what they’re doing or not correctly interpret what they get, of course. But as far as I know, the math is correct.

So, what and whom do you trust more to get the complex math of statistical models right? The individual scientist and their peer reviewers, some or all of whom may not be professional statisticians or programmers, or the folks who are employed by SAS or SPSS or whomever who are selling (way overpriced) statistical software for a living?

What do I prefer? Well, that doesn’t matter, now does it? There’s enough people who are happy to crap on me for not using R that I pretty much have do the switch. But this also means that I actually will have less confidence in my results, because I will have no fricking clue if I made a small error in my code that gives me the wrong thing. And I can’t really rely on my collaborators and peer reviewers to catch everything, either.

It’s a good thing that, while coding for R, you are essentially forced to comprehend the statistics that you’re doing. The lack of point-and-click functionality means that you aren’t just going to order up some abritrary stuff and slap it into a manuscript without knowing what happened. This makes it a great teaching tool. What’s worse? Using menu-driven black box software and potentially reporting the wrong thing, or making uncaught coding errors and potentially reporting the wrong thing? I don’t know. I have no idea which problem is bigger. Speaking for my own work, I’m more concerned about the latter. For me, using R won’t up my statistical game one bit. It’ll just put me in line with emerging practices and allow me to conduct certain kinds of analyses that aren’t readily available to me otherwise.

If you disagree with this opinion, I’d be hoping for some data, so we can avoid Battling Anecdote Syndrome.


 

*My lab has a bottleneck when it comes to getting these papers out. This project was done in collaboration with two rockstar undergrads from my university, and I was hoping that they could take the lead on the project. One promptly moved on to a PhD program, and in consultation with their PI, they don’t have the opportunity to take the lead. The other is not in a position to take the lead on this paper either (it’s complicated).

17 thoughts on “Open source software doesn’t necessarily mean we’ll have better stats

  1. I think if one is not confident about coding, it would be possible to use R in a pretty much SPSS-like way and minimize the risk of coding errors. When I use R, most of the ‘coding’ I do is to organize data or automate multiple analyses. The actual stats (the regression or ANOVA or whatever) is usually just one command. If I wanted to, I could do all my organizing and whatnot by hand in Excel, then load the Excel sheet into R and issue one command for my ANOVA or whatever; indeed this is what most of my colleagues who use R tend to do. And when I have made errors in my R code, it hasn’t been in the actual statistics, it’s been in the other stuff (pulling out the right subsets of data or whatever). Sure, I could make an error there by e.g. specifying one of the function parameters wrong, but I don’t think this is substantively any different from the risk of clicking a wrong button in SPSS. All that is to say, I think one could avoid this worry about coding by just doing almost the entire workflow in Excel and then just pulling the data over to R at the last minute for stats (how much you can avoid R probably depends on which stats you’re doing). I think you’d then be throwing out a lot of what makes R useful (the ability to write loops and yada yada) but you’d be minimizing the chance of introducing errors, which may be worthwhile, especially when the code is being written by someone not very experienced or comfortable with coding.

  2. Which package is more error prone: a commercial product or an open source code set? I have no idea, that’s an empirical question for which I have no data. The biggest lesson I takeaway from the above anecdote is the necessity for replication. In this case, pre-publication replication increased reliability of the findings (eventually), but all too often promising results are never replicated prior to or after being published.

  3. Hi Terry — this is a sticky subject, as my wife (the actual biologist in the home) prefers SPSS, while I’m a recent R convert (but … a geologist). Here are a few anecdotes from my experience.

    Excel is most maligned not just because of it’s poor calculation precision or inconsistency between platforms, but because it masks a user’s errors. It’s really hard to make a “pessimistic” worksheet – that catches errors for you. In fact, a rather famous Excel error motivates the austerity measures punishing much of Europe (where I live) today – remember the London Whale (https://baselinescenario.com/2013/02/09/the-importance-of-excel/)?

    R, on the other hand, shows you a lot of the internals. It pushes users more toward process-based engineering — you write a useful function, vectorise it for efficiency and reuse it, maybe share it with your team, eventually the bugs become apparent and it’s a better function for it. Like guessing the weight of cows (http://www.npr.org/sections/money/2015/08/07/430372183/episode-644-how-much-does-this-cow-weigh), the more people involved, the better the result (roughly).

    One CAN do this in Excel too, but … copy & paste is often too easy. And macros. Ugh.

    A Microsoft example (but back when their stuff was pretty terrible). In, 2000 the startup I worked for in Oregon had a Word file containing our life: it was a funding document, had been through management, legal, finally was ready to send to the Cavanaugh Group or whoever the angel was that quarter, and … someone accepted changes and the document self-corrupted. Nobody could open it. Some combination of a tracked-change in a bullet list I think. A clever IT person with a covert Storm Linux installation (since we were dealing with monetary exchange, our funders had demanded licensed-software ONLY on all machines) opened the Word doc using OpenOffice; turned off Track Changes; calmly emailed it to our CEO, saving our round D financing effort and collective bacon. Quietly.

    The fact is, open-source software is not always perfect, but it’s generally had more eyes with more varied backgrounds reviewing it; that’s not a quality guarantee, but it’s significant. My example applies only to an extreme case — big file, lots of sharing, lot of participants tracking changes — that Microsoft’s QA people didn’t anticipate, but somehow an OpenOffice coder/QA team did. It’s unlikely something as mainstream as an ANOVA is gonna fail in SPSS.

    Having redundant systems arrive at the same value is generally the best solution I guess, but you’re right — would’ve required more R expertise at the first discrepancy.

    Finally — if you’re still there — cast an eye over to Jeffrey Rouder’s Born Open Data piece, generally behind a paywall but also available here:
    http://pcl.missouri.edu/sites/default/files/p_9.pdf

    Now Rouder isn’t a biologist but a behavioral psychologist I guess. I think it’s still applicable. Since you’re on sabbatical maybe we can skype about this sometime, Terry… :-)

  4. I think a point that hasn’t been considered is the cost of these programs. As a MS student, I used SAS because it was available on the computers in my advisor’s lab. As a PhD student, I use R because I don’t have access to SAS (and don’t have the money to pay for it). I think Steve Politzer-Ahles (above) is correct in that the actual stats that can be performed in R are the same as those in SAS, SPSS, etc. and the commands are basically the same as well. As such, why would I choose to purchase a stats program when I can use R for free??

  5. I think you’re mixing two different issues here. One is programming vs. point-and-click, and one is open source vs. closed source.

    The fundamental problem with point-and-click is that the analysis is not reproducible. Unless you did a video recording of you clicking through the options, we will never know what exactly you did and which options you chose. By contrast, if you wrote some code (and saved it!), then others can later rerun the code if they want to. I’ve had too many experiences with non-reproducible stats (even from interactive use of R) that I now want everybody in my group to always save the code that they used, for every analysis.

    Open vs. closed source is more about price and availability. I will use closed-source tools when I have to, e.g. Mathematica for computer algebra calculations. I must say, though, that I frequently look at the ggplot2 source code to figure out how to make figures in R. For complex issues it can be simpler (and clearer) than reading the documentation.

  6. This gets at a problem I have – trust issues with my own coding expertise leads me to do the analyses in R, then confirm them in SPSS or SAS. Now I spend twice the time on stats because there’s pressure to use R (everyone’s using it), but I don’t want to be wrong!

  7. I’d just like to +1 Claus’ comment.

    You know I’m no stranger to the opensci advocacy circuit. BUT I also am pretty stubborn, and don’t pick up new practices until I see how they both benefit my workflow AND make my life easier.

    Reporting standards in scientific papers just don’t capture every single thing you did with a point-and-click interface- even if it’s just the most basic of summary stats- and if you did, it would take longer to reproduce, and be less reliable, than if you had a script. You can make an error, or many errors, in either platform, but at least in a scripted analysis, you have evidence of where you went wrong, you can make tweaks, re-run. Redoing analyses in a GUI isn’t too painful when it’s one or two stats you need, but if it gets any bigger than that, it quickly becomes more hair-pulling-inducing and more prone to error (ie making sure you’ve specified your model exactly the same, every time).
    You can script analysis in R, but you can also do so in SAS. The main issue with the latter is it’s not cheap- I don’t maintain a licence anymore so that makes it pretty hard for me to revisit any of my earlier work, if someone has a question. So, um, please no one ask me about any work I did before ~2010, except maybe in a general way.

    I’m currently writing a script that finds changes in how a population is regulating itself by dividing a time series up at a given point, fitting the model, recording some summary statistics, and then moving on to the next time point. This could be done in a GUI- but just think of how many steps that would take, how frankly awful it would be to make sure I was specifying and fitting the model the same way with each loop.

  8. The reproducibility of coded analyses in pre- or post-publication review is a desirable side-effect, but I would say the main advantages of coding a data to results workflow occurs before the paper is submitted. Even when the initial coding of data preparation, analysis and visualization takes more time than using a point-and-click software, having a working piece of code saves times for any repetition of the analysis (e.g. if the data gets updated, the model gets edited, etc.) and increases reliability between repetitions.

    The amount of repetition where coding becomes time-saving depends on the specific problem and the coder’s experience, but I would argue the threshold is reached for most studies that end up being published. (While this is more personal experience than hard data, I can’t think of any instance where a data-to-results workflow “only had to be done once”.)

    I agree with previous commenters that moving to coded analyses is both a bigger and more impactful step than moving from proprietary to open-source coding environments. And even if there were no errors in the underlying code, we should acknowledge that the varying quality of documentation and the inconsistent syntax/semantics between R packages can be a disadvantage to new users. The converse point is that as users gain experience with the language, they can always check the source code of a tool or function to see why it doesn’t run as intended.

  9. Just as a tip: there are lots (and I really mean LOTS) of really nice and free MOOCs out there that you can use to improve your R skills. As your skills improve, so will your confidence.

  10. As someone who has done a good deal of stats work in Industry and the moved to research by getting into a graduate program, it is something of a difference in perspective. For example – “professional” software packages, Minitab, JMP, SPSS, Statistica, etc. are expensive in no small part because they have to be validated – not just in peer review, but using test problems, code analysis, all of that stuff. From an industry standpoint in places like Medical Device – when we submit to the FDA, they have an expectation that the software we use has gone through validation – in fact it is required by law as part of the CFRs governing the industry. (Non – product software validation.) So, the reality is that on the industry side we would never trust something like R, because there isn’t anything like that level of structure to the validation of all of the functions. Or at least we would have to have the validation for the functions we were using, and probably would compare the results across another package – as you did. Certainly for many there are – but say the implementation of the Random Forest Bagging approach to bootstrapping. Does that code have such a level of validation and testing to prove that it is implemented correctly for all of it’s options – and those that are not are well documented? I’m pretty sure the answer is no – or it was no when I was using it last about 6 months ago.

    However, the trade is functionality. I’m not an expert with JMP by any means, but I doubt it has the aforementioned Random Forest Bagging bootstrap functionality to tie to regression or whatever else you want to do with a small data set. R gets new functionality which, to be honest, for the most part industry generally doesn’t need, because the focus isn’t on research in the same way. Certainly for any given implementation of a function – I am more likely to trust one of the expensive packages – because fundamentally that is what I am paying for. However, more likely if I’m getting into publishing space? Run it in more than one place to see if it compares.

  11. What Klaus said.

    Programming errors happen and I have no doubt that there are many errors in the literature because of these typos/bugs. But there are also errors in the literature due to typos or general screw-ups moving data from some machine output to excel to jmp and then going through the hierarchy of window and menu options to get a particular jmp output. This post actually highlights the reason to script. Because results are reproducible. Had you made an error by clicking the wrong button in jmp or creating a transformed column in jmp with no documentation of the transform we’d have no documentation of it or how you got your result.

  12. I use Excel and R in pretty much exactly the way Steve Politzer-Ahles describes: most of the staring-at-numbers and looking for patterns happens in Excel, with actual statistics happening in R. It’s a kludge, it will always be a kludge, to do it this way but it means a day spent playing with my data usually results in something whereas setting up the best option – the highly repeatable, fully documented code – is at the top of a very steep learning curve and I’m just not prepared to climb up there.

    Shifting topic slightly, the expressed concerns about one’s own coding skills vs. hidden and unknowable problems within commercial software brings to mind non-software scientific tools. I study greenhouse gas fluxes from terrestrial ecosystems, and I’ve used a variety of tools to measure concentrations of things like carbon dioxide and nitrous oxide. It’s theoretically possible that I could build my own CO2 detector from generic parts, but it’s cheaper and MUCH easier to buy a device that prominently reminds me that opening the case risks voiding my warranty.

    I tend to open the case anyway – warranties aren’t worth much when you’re way out the back of beyond in the field – and I fiddle with the software as much as my skills allow (changing some of the defaults is about as far as I go).

    Stastical analyses are central to the information contained in a scientific paper, so I can understand why a reviewer might object to running test A in software X and suggest running test B in software Y. Occassionally, a reviewer might object to measuring parameter A using device X, or insist that device X’s accuracy, precision, and other characteristics be verified by comparison with device Y. Do these discussions about R vs. JMP/SAS/SPSS have a real parallel in hardware?

    Sorry, no data to contribute and my anecdotes are crappy.

  13. Hi Terry,

    Thanks for the post. Some ideas/comments:

    For me, code is a must. Having code will not only save you every time you have to redo an analysis or figure (as always happens), but also will make it much easier to check and revise for errors, either by you, your collaborators, or peer reviewers. If there is no code, one will have to redo the full analysis by hand. In contrast, there are many tools to help write good code that works properly (e.g. see https://thepoliticalmethodologist.com/2016/06/06/embrace-your-fallibility-thoughts-on-code-integrity/). Although of course we will never be 100% sure code is bug-free.

    Furthermore, your code can be helpful to other people wanting to do similar analysis to yours. So, papers with code are not only valuable for the results they present, but also because they facilitate other scientists’ work. See e.g. Ben Marwick’s article in The Conversation for a nice overview of the reasons to prefer code (https://theconversation.com/how-computers-broke-science-and-what-we-can-do-to-fix-it-49938).

    There is certainly evidence that manual (point-and-click, copy-paste) analyses bring many errors that could be avoided by using code. For example, read this piece on Nature about the statcheck software (http://www.nature.com/news/smart-software-spots-statistical-errors-in-psychology-papers-1.18657), or this one (https://thepoliticalmethodologist.com/2016/06/06/embrace-your-fallibility-thoughts-on-code-integrity/) which found many errors in published papers due to copy-pasting of statistical results, or failures to update results when data are revised or modified.

    As an early field ecologist who later became convinced of the importance of getting training in stats and programming, I fully sympathize with the struggle that learning represents. But in my opinion there is no shortcut. If we want to get somewhere by car, we either have to learn how to drive well, or let someone else drive. If we choose to drive without proper training, we will crash sooner than later (of course, experienced drivers can crash too, but they are less likely). So if we ecologists want to analyze our own data, we’ll have to get properly trained, or otherwise seek help from more data-savvy people, I think, to maximize the probability that our results are correct.

  14. @prodriguezsanchez: Excellent comment! Your last link is the same as the first but I think should be a link to something different. Is this a bug (or a feature)?

  15. Hi Terry,

    Thanks for this nice post.
    I used to naively think that stats would be done better using R but as with every other stats software one needs to understand what is being done (I really like prodriguezsanchez comment in this regard). One nice feature with R is that there is a vibrant community of users ready to help each other when someone has specific questions if he/she is doing the analysis right. The Mixed effect model mailing list (https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models) is one example of such a list where users can present the analysis they want to perform, what they’ve tried and what they don’t understand. I guess that one should aim to show his/her analysis to critical scrutiny by peers before writing up a manuscript.

  16. Hi Terry,

    Great post, it made an interesting read. There are many situations where “small” bugs cause big problems. Programming is difficult.

    However, in many instances algorithms are deterministic it is possible to create test sets where you know what the outcome should be. In an ideal world the person that wrote the program should create these test sets. Then you can verify that the code is working as expected (at least for the test data).

    In software development there is a practise known as test-driven development. It can be very useful in scientific contexts like these. I have written a blog post about this, using FASTA file parsing as an example (but could apply equally well to number crunching problems).

    http://tjelvarolsson.com/blog/test-driven-develpment-for-scientists/

    Cheers,
    Tjelvar

Leave a Reply to Maggie TuckerCancel reply