Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 1;25(1):4-12.
doi: 10.1093/jamia/ocx120.

Reproducible Bioconductor workflows using browser-based interactive notebooks and containers

Affiliations

Reproducible Bioconductor workflows using browser-based interactive notebooks and containers

Reem Almugbel et al. J Am Med Inform Assoc. .

Abstract

Objective: Bioinformatics publications typically include complex software workflows that are difficult to describe in a manuscript. We describe and demonstrate the use of interactive software notebooks to document and distribute bioinformatics research. We provide a user-friendly tool, BiocImageBuilder, that allows users to easily distribute their bioinformatics protocols through interactive notebooks uploaded to either a GitHub repository or a private server.

Materials and methods: We present four different interactive Jupyter notebooks using R and Bioconductor workflows to infer differential gene expression, analyze cross-platform datasets, process RNA-seq data and KinomeScan data. These interactive notebooks are available on GitHub. The analytical results can be viewed in a browser. Most importantly, the software contents can be executed and modified. This is accomplished using Binder, which runs the notebook inside software containers, thus avoiding the need to install any software and ensuring reproducibility. All the notebooks were produced using custom files generated by BiocImageBuilder.

Results: BiocImageBuilder facilitates the publication of workflows with a point-and-click user interface. We demonstrate that interactive notebooks can be used to disseminate a wide range of bioinformatics analyses. The use of software containers to mirror the original software environment ensures reproducibility of results. Parameters and code can be dynamically modified, allowing for robust verification of published results and encouraging rapid adoption of new methods.

Conclusion: Given the increasing complexity of bioinformatics workflows, we anticipate that these interactive software notebooks will become as necessary for documenting software methods as traditional laboratory notebooks have been for documenting bench protocols, and as ubiquitous.

Keywords: automated; bioconductor workflows; containers; data science; reproducibility.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of our approach. The author of the Bioconductor workflow uses BiocImageBuilder to generate a Dockerfile that describes the Bioconductor and CRAN packages installed. The Dockerfile and the notebook files are uploaded to a server or GitHub repository. A custom container is then built with the default Linux base image for Bioconductor, dependencies for Jupyter, JuptyerHub, and/or Binder, and the Bioconductor packages. For GitHub installations, the Binder server builds the container and provides a link to run the container on its public cluster. JupyterHub provides the same functionality locally or on a private server. Using the container, the end user is able to view the notebook and execute, modify, and save the code on his or her local machine regardless of whether it uses Linux, MacOS, or Windows. In the case where the container is run remotely, no additional installation of software is required on the part of the end user.
Figure 2.
Figure 2.
Screenshot of BiocImageBuilder. The user selects from a menu the Bioconductor and CRAN packages required for his or her notebook. BiocImageBuilder then generates the Dockerfile describing a minimal Linux container that contains these packages. The Dockerfile can be uploaded to GitHub, where it can be viewed interactively using Binder.
Figure 3.
Figure 3.
Outcome of initial TVS scan. PUL = pregnancy of unknown location; TVS = transvaginal ultrasound scan; EP = ectopic pregnancy.

References

    1. Peter A, Michael RC, Nebojša T, et al. Common Workflow Language, v 1.0. 2016.
    1. Freedman LP, Cockburn IM, Simcoe TS. The economics of reproducibility in preclinical research. PLoS Biol. 2015;136:e1002165. - PMC - PubMed
    1. Meiss T, Hung L-H, Xiong Y, Sobie E, Yeung KY. Software solutions for reproducible RNA-seq workflows. bioRxiv. 2017:099028.
    1. Gentleman RC, Carey VJ, Bates DM, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;510:R80. - PMC - PubMed
    1. Vivian J, Rao A, Nothaft FA, et al. Rapid and efficient analysis of 20,000 RNA-seq samples with Toil. bioRxiv. 2016:062497.

Publication types