Reproducibility of computational workflows is automated using continuous analysis

Brett K Beaulieu-Jones¹, Casey S Greene²

Affiliations

¹ Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
² Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

PMID: 28288103
PMCID: PMC6103790
DOI: 10.1038/nbt.3780

Reproducibility of computational workflows is automated using continuous analysis

Brett K Beaulieu-Jones et al. Nat Biotechnol. 2017 Apr.

. 2017 Apr;35(4):342-346.

doi: 10.1038/nbt.3780. Epub 2017 Mar 13.

Authors

Brett K Beaulieu-Jones¹, Casey S Greene²

Affiliations

¹ Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
² Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

PMID: 28288103
PMCID: PMC6103790
DOI: 10.1038/nbt.3780

Abstract

Replication, validation and extension of experiments are crucial for scientific progress. Computational experiments are scriptable and should be easy to reproduce. However, computational analyses are designed and run in a specific computing environment, which may be difficult or impossible to match using written instructions. We report the development of continuous analysis, a workflow that enables reproducible computational analyses. Continuous analysis combines Docker, a container technology akin to virtual machines, with continuous integration, a software development technique, to automatically rerun a computational analysis whenever updates or improvements are made to source code or data. This enables researchers to reproduce results without contacting the study authors. Continuous analysis allows reviewers, editors or readers to verify reproducibility without manually downloading and rerunning code and can provide an audit trail for analyses of data that cannot be shared.

PubMed Disclaimer

Conflict of interest statement

Competing Financial Interests

The authors have no competing financial interests to declare.

Figures

**Figure 1. Reporting of custom CDF file descriptors in published papers**
Works citing custom chip description files (Custom CDF) frequently do not cite the version. Each manuscript is represented by a circle in which color indicates the version used by each paper. **A.)** 51 of the 100 most recent papers citing Dai et al. do not list a version (4 additional papers were excluded from analysis because they cited Dai et al. but do not use the Custom CDF). **B.)** 64 of the 100 most cited papers which cite Dai et al. do not list a version. (15 additional papers were excluded from analysis because they cited Dai et al. but do not use Custom CDF).

**Figure 2. Research computing versus container-based approaches**
**A.)** The status quo requires a reader or reviewer to find and install specific versions of dependencies. These dependencies can become difficult to find and may become incompatible with newer versions of other software packages. Different versions of packages identify different numbers of significantly differentially expressed genes from the same source code and data. **B.)** Containers define a computing environment that captures dependencies. In containerbased systems, the results are the same regardless of the host system.

**Figure 3. Setting up continuous analysis**
Continuous analysis can be set up in three primary steps (numbered 1, 2, and 3). (1) The researcher creates a Docker container with the required software. (2) The researcher configures a continuous integration service to use this Docker image. (3) The researcher pushes code that includes a script capable of running the analyses from start to finish. The continuous integration provider runs the latest version of code in the specified Docker environment without manual intervention. This generates a Docker container with intermediate results that allows anyone to rerun analysis in the same environment, produces updated figures, and stores logs describing everything that occurred. Example configurations are available in the online methods our online repository (https://github.com/greenelab/continuous_analysis). Because code is run in an independent, reproducible computing environment and produces detailed logs of what was executed, this practice reduces or eliminates the need for reviewers to re-run code to verify reproducibility.

**Figure 4. Reproducible workflows with continuous analysis**
Resulting figures from the run are committed back to Github where changes between runs can be viewed. **A, B.)** The effect of adding an additional gene (HumanTw2) to a phylogenetic tree-building is shown. **C, D.)** The effect of adding an additional sample (mt8) to an RNA-seq differential expression experiment PCA plot is shown.

See this image and copyright information in PMC

References

1. Rebooting review. Nat Biotech. 2015;33(4):319. doi: 10.1038/nbt.3202. - DOI - PubMed
1. Software with impact. Nat Meth. 2014;11(3):211. doi: 10.1038/nmeth.2880. - DOI - PubMed
1. Peng RD. Reproducible Research in Computational Science. Science (80-) 2011;334(6060):1226–1227. doi: 10.1126/science.1213847. - DOI - PMC - PubMed
1. McNutt M. Reproducibility. Science (80-) 2014;343(6168):229. http://science.sciencemag.org/content/343/6168/229.abstract. - PubMed
1. Illuminating the black box. Nature. 2006;442(7098):1. doi: 10.1038/442001a. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reproducibility of computational workflows is automated using continuous analysis

Affiliations

Reproducibility of computational workflows is automated using continuous analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources