At this point, we have covered quite a bit. Open as a principle of software development, open as a principle of human and machine interpretable, and open as a facet of reproducible workflows.
If we think back to the content covered in Open Workflows, and the discussion of reproducibility and replicability, it’s worth considering that reproducibility is really about internal validation while replicability is about external validation. Reproducibility confirms the same data and processing methods will produce the same result. Replicability contributes to the evidence base by conducting a new study modeled on a previous study.
In the spirit of open as it relates to the digital environment and reproducibility, one of the gold standards when we’re looking at a single study is computational reproducibility; that is to say, if I pass off all of my inputs (data, scripts etc) to someone else, can they, on their computer, reproduce what I did exactly. Unfortunately, the answer is frequently no, because computers are complex environments, and no two machines are going to have exactly the same environment; hardware and software differences will exist and these will impact how data is processed by a program. That is, unless we use a container and employ the concept of containerization.
The full details of how containerization is deployed are really beyond an introductory module on open research. But the principles being addressed by containerization are critical to navigating a digital environment when we think about the ability of work, embedded within a piece of software, to be validated by others.
A Recipe for Understanding Containers
The thing about any piece of software or any script is that it is never fully self-contained. We always rely on dependencies or pre-existing bundles of code, usually called libraries. Think of it like baking muffins.
When you write your R or Python script, you’re writing out your recipe; a set of instructions with particular steps that need to be followed. Your recipe requires certain things to be fully executed though. It requires:
- an environment in which to run; let’s call this your kitchen
- something to process your ingredients into the end product; let’s call this your oven
- something to validate all the ingredients; let’s call this your mixing bowl
- a list of ingredients; let’s call these your dependencies or libraries.
Finally, your favourite muffin recipe is dependent on you having eggs, butter, white flour, and cow’s milk. Let’s now imagine that when you built your working environment — your kitchen — you made sure to include a lifetime supply of all of these dependencies — your ingredients. All is good until you come home one day, and realize that your partner has done some upgrades. One of these upgrades is to replace all of your white flour with rye flour and your cow’s milk with goat’s milk.
This upgrade was ostensibly made to reflect the need for a healthier lifestyle. Beyond potentially being annoyed about the lack of consultation, maybe you see the problem? Next time you try to make your muffins, your validator — your mixing bowl — will be expecting white flour and cow’s milk. Unable to find these ingredients, your mixing bowl will fail to pass all the ingredients off to your oven. No more muffins, in spite of the most well-documented script — your recipe — being in hand.
Even if you never had any upgrades done, what if your friend wanted your recipe? Sure, you could give them the script, and they could source all the ingredients. But if you wanted to ensure that the recipe was a perfect match to your own, you’d gift-wrap all the ingredients with the recipe attached, ensuring success.
This gift wrapping, or bundling, is exactly what software that containerizes a piece of code does — it ensures that the code is accompanied by the appropriate environment and dependencies so that it will run into the future. This is a critical aspect of reproducibility.
Docker is a popular open-source tool for containerizing software and code. For those working in the realm of High-Performance Computing, Singularity is another popular option.
To learn more about containers and the way they can improve reproducibility, review the following videos:
- A conference presentation on the basic elements and implementation of Docker: Using Docker Containers to Improve Reproducibility in PL/SE Research (42:08)
- An introduction to containers using Singularity and some of the differences between Docker and Singularity (48.23)
- An introduction to setting up a container using Singularity (13:06)