33 5.2 Why it Matters?
In some aspects of the research process, being able to follow the linkages between outputs and inputs is a common expectation. For example, when we read a paper, we expect it to cite its sources. We also expect to be able to track down those sources and investigate the strength and validity of the claims being made.
Consider the following scenario:
Scenario – Open, Reproducible Research
Abdul is a geneticist working at UBC. They recently came across an article in their area of research. Abdul contacted the authors to inquire about accessing the data and scripts for both cleaning and analysing the data; if the results of the study could be confirmed, there could be a huge impact for Abdul’s area of practice. The authors forwarded Abdul the data, but responded that they hadn’t kept track of everything they did with the data; some of the clean up and organization happened in Excel and subsequent analyses were done in the statistical program SPSS.
Without being able to reproduce the analyses done on the data – changes were not tracked in Excel and Abdul does not have access to SPSS, a proprietary, closed source application – Abdul is unable to verify the findings claimed in the research article. This leaves Abdul unsure how these findings should be evaluated and interpreted.
In the example above, the research that Abdul came across was neither transparent nor reproducible (refer back to the Open Research module on Reproducibility & Replicability). The use of open software in this situation would have been one step the researchers could have taken to remediate this. Such a choice would have also contributed to the posterity and reliability of both their research inputs and outputs.
Increasingly, if data is being summarized, we expect to be able to review the underlying raw data to understand how it’s been transformed.
Likewise, we should also expect to be able to see and understand the software that was used to interpret that data and generate that output. Proprietary software limits our ability to do so. Open software, on the other hand, helps to increase this transparency.
Additionally, we’ve all encountered files in formats not supported by our operating system or any current program available. Ideally, whether it be the ethics application that initiated a research project, the data collection tool employed, the data processing tool used, or the final output – poster, audio, video, traditional manuscript – we expect to be able to review the content 5, 10, 15 years post-production. Open software, using open formats, helps to facilitate this.
Reflecting back to the example above, if the data that Abdul needed was readily available in a standardized, open format (more about this in the module on Open Data) and the cleaning and processing of this data done using a readily available open source software solution with proper documentation, Abdul could have more readily engaged in verifying the results and potentially improving their own research practices. As we’ll see shortly, using scripts to handle the data would improve this transparency and reproducibility still further; making the process an open process using open tools.