Code review for science: What we learned
The results from our code review pilot are in!
This past August, we launched a pilot with PLOS Computational Biology and some of our colleagues at Mozilla to explore the idea of code review for science. With the help of PLOS, we selected a series of published papers that included code, extracted the snippets (between 200-500 lines), and put in front of Mozilla engineers. The code samples selected were not whole software packages, but rather indicative of the type of analysis code one would find in a paper, in Python, R, Perl, etc.
This experiment was a means to explore the following:
- What does code review from a software engineer outside of academia look like? How do they approach the task?
- To what extent is domain knowledge needed to do a successful code review? Is the code parseable by someone outside of that discipline?
- What lessons can be learned about code review, possibly to influence and enhance traditional peer review?
- Does this process surface issues around best practice in writing software and code? If so, what are those issues?
- Following the review, how useful are the comments to the author? Does this feedback help them in their work? How can we change those norms?
Following the completion of the reviews, Professor Marian Petre (Open University) interviewed the Mozilla volunteers about their experience, and we reached out to the paper authors to share the comments on their code. They were then also interviewed by Marian, to gauge their thoughts on the reviews and the process in general.
What did we learn?
The full write-up from the pilot can be read here on arXiv.
A few high-level points to tease out:
For many scientists, this was their first experience with code review.
Some of the authors were familiar with code review as a process, but many claimed that this was a new experience for them, as well as for some their first chance to have a discussion about their code.
“The code was not written for others to use.”
While the scientists aimed to produce readable, re-usable code, the reviewers felt their software was less reusable by others than what they were used to. Lack of commenting in the code and documentation added to this, which the reviewers identified as a blocking point for other researchers to build upon that scientist’s work.
Context and dialogue are key parts of the review process.
… And I don’t mean just the context of what’s written in the research paper itself. The reviewers themselves felt their comments were shallower than they’d have liked, and recommended that there’s an ongoing dialogue with the author in the future as the code is being written to help iterate, debug, better understand the context in which the code exists for a piece of work.
With that said, the authors still found the comments useful, particularly feedback on usability, ease of re-use, organization of README files, code structure, performance questions and optimization (“why is this so slow?”).
Both the scientists and the reviewers were frustrated by the “drive by” nature of this experiment: both wanted a longer conversation with a chance to go back and forth. This, and the fact that both sides are enthusiastic about taking part in a follow-on, are probably the most important of our findings.
For more, have a look at the full report.
The conversation didn’t stop there …
“In the business of science, all that matters is the figures. The quality of the code is just not on the critical path.” – scientist interviewed in the pilot.
Something interesting happened as we were conducting this pilot. As the project was running, we had a feature in Nature about our work exploring code review, that sparked quite a lively discussion online. (Which, in many ways, was the point of such a pilot … ).
The discussion stemmed from a dissenting comment at the bottom of the Nature piece from a researcher at Johns Hopkins, one known for his work in reproducibility, saying that an experiment like this could actively “discourage” the sharing of code.
From the article:
“One worry I have is that, with reviews like this, scientists will be even more discouraged from publishing their code,” says biostatistician Roger Peng at the Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland. “We need to get more code out there, not improve how it looks.”
I and a number of others (Titus Brown, Nick Barnes, Carl Boetigger) took to Twitter to try to unpack the reasoning behind the comment – was it a fundamental misunderstanding of what we were testing? Was it a question of methods or of review on the whole? The Nature reporter even chimed in with a snapshot of the researcher’s comments in full from her notes.
The conversation then moved to our blogs, further unpicking the comments online, and continuing to explore the implications of code review for science when it comes to reproducibility, openness, collaboration. Carl and Titus both wrote up the discussion, adding their .02 to the matter (making some very valid points, mind), and a researcher in Peng’s lab at Johns Hopkins (note: not the person behind the quote, but a close colleague), chimed in with a post further explaining the comment in that Nature piece. You can read that post by Jeff Leek here.
Jeff’s post leads with a look into their lab’s processes when it comes to detailing code in their work, speaking to how they make all of their code available openly for review and reuse. Then midway down, he starts to explain why he thinks this would *discourage* sharing, and it’s linked to an experience he had where a peer tried to discredit their work once an error was raised (and fixed) in the work. That sort of discrediting is unfortunately common play in research, keeping many from working in an open, constructive and iterative fashion, where feedback is welcomed and not career jeopardizing.
What we realized from this dialogue is that we don’t wholly disagree with Roger’s comment: if code review is done only at the end of work, it becomes another hurdle for scientists to get over in order to publish their research. But that’s not how it works in open source — in fact, few people in open source would willingly work that way.
Review is supposed to be continuous and participatory; people should have a chance to respond to feedback in order to improve both what they’re working on now, and how they work in future. And to Jeff’s experience, working in that fashion should be the norm, not something used against you to discredit your work by peers.
The next stage of our work is to explore how well this works in science, and what it takes to get scientists to adopt more constructive, iterative, collaborative practices.
Many thanks again to our Mozilla colleagues who participated in this study, the PLOS staff, Professor Marian Petre, Greg Wilson, the authors, and everyone who joined in online to discuss this issue.