Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr;35(4):342-346.
doi: 10.1038/nbt.3780. Epub 2017 Mar 13.

Reproducibility of computational workflows is automated using continuous analysis

Affiliations

Reproducibility of computational workflows is automated using continuous analysis

Brett K Beaulieu-Jones et al. Nat Biotechnol. 2017 Apr.

Abstract

Replication, validation and extension of experiments are crucial for scientific progress. Computational experiments are scriptable and should be easy to reproduce. However, computational analyses are designed and run in a specific computing environment, which may be difficult or impossible to match using written instructions. We report the development of continuous analysis, a workflow that enables reproducible computational analyses. Continuous analysis combines Docker, a container technology akin to virtual machines, with continuous integration, a software development technique, to automatically rerun a computational analysis whenever updates or improvements are made to source code or data. This enables researchers to reproduce results without contacting the study authors. Continuous analysis allows reviewers, editors or readers to verify reproducibility without manually downloading and rerunning code and can provide an audit trail for analyses of data that cannot be shared.

PubMed Disclaimer

Conflict of interest statement

Competing Financial Interests

The authors have no competing financial interests to declare.

Figures

Figure 1
Figure 1. Reporting of custom CDF file descriptors in published papers
Works citing custom chip description files (Custom CDF) frequently do not cite the version. Each manuscript is represented by a circle in which color indicates the version used by each paper. A.) 51 of the 100 most recent papers citing Dai et al. do not list a version (4 additional papers were excluded from analysis because they cited Dai et al. but do not use the Custom CDF). B.) 64 of the 100 most cited papers which cite Dai et al. do not list a version. (15 additional papers were excluded from analysis because they cited Dai et al. but do not use Custom CDF).
Figure 2
Figure 2. Research computing versus container-based approaches
A.) The status quo requires a reader or reviewer to find and install specific versions of dependencies. These dependencies can become difficult to find and may become incompatible with newer versions of other software packages. Different versions of packages identify different numbers of significantly differentially expressed genes from the same source code and data. B.) Containers define a computing environment that captures dependencies. In containerbased systems, the results are the same regardless of the host system.
Figure 3
Figure 3. Setting up continuous analysis
Continuous analysis can be set up in three primary steps (numbered 1, 2, and 3). (1) The researcher creates a Docker container with the required software. (2) The researcher configures a continuous integration service to use this Docker image. (3) The researcher pushes code that includes a script capable of running the analyses from start to finish. The continuous integration provider runs the latest version of code in the specified Docker environment without manual intervention. This generates a Docker container with intermediate results that allows anyone to rerun analysis in the same environment, produces updated figures, and stores logs describing everything that occurred. Example configurations are available in the online methods our online repository (https://github.com/greenelab/continuous_analysis). Because code is run in an independent, reproducible computing environment and produces detailed logs of what was executed, this practice reduces or eliminates the need for reviewers to re-run code to verify reproducibility.
Figure 4
Figure 4. Reproducible workflows with continuous analysis
Resulting figures from the run are committed back to Github where changes between runs can be viewed. A, B.) The effect of adding an additional gene (HumanTw2) to a phylogenetic tree-building is shown. C, D.) The effect of adding an additional sample (mt8) to an RNA-seq differential expression experiment PCA plot is shown.

Similar articles

Cited by

References

    1. Rebooting review. Nat Biotech. 2015;33(4):319. doi: 10.1038/nbt.3202. - DOI - PubMed
    1. Software with impact. Nat Meth. 2014;11(3):211. doi: 10.1038/nmeth.2880. - DOI - PubMed
    1. Peng RD. Reproducible Research in Computational Science. Science (80-) 2011;334(6060):1226–1227. doi: 10.1126/science.1213847. - DOI - PMC - PubMed
    1. McNutt M. Reproducibility. Science (80-) 2014;343(6168):229. http://science.sciencemag.org/content/343/6168/229.abstract. - PubMed
    1. Illuminating the black box. Nature. 2006;442(7098):1. doi: 10.1038/442001a. - DOI - PubMed

Publication types

LinkOut - more resources