Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct 15;19(Suppl 10):349.
doi: 10.1186/s12859-018-2296-x.

Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines

Affiliations

Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines

Neha Kulkarni et al. BMC Bioinformatics. .

Abstract

Background: Reproducibility of a research is a key element in the modern science and it is mandatory for any industrial application. It represents the ability of replicating an experiment independently by the location and the operator. Therefore, a study can be considered reproducible only if all used data are available and the exploited computational analysis workflow is clearly described. However, today for reproducing a complex bioinformatics analysis, the raw data and the list of tools used in the workflow could be not enough to guarantee the reproducibility of the results obtained. Indeed, different releases of the same tools and/or of the system libraries (exploited by such tools) might lead to sneaky reproducibility issues.

Results: To address this challenge, we established the Reproducible Bioinformatics Project (RBP), which is a non-profit and open-source project, whose aim is to provide a schema and an infrastructure, based on docker images and R package, to provide reproducible results in Bioinformatics. One or more Docker images are then defined for a workflow (typically one for each task), while the workflow implementation is handled via R-functions embedded in a package available at github repository. Thus, a bioinformatician participating to the project has firstly to integrate her/his workflow modules into Docker image(s) exploiting an Ubuntu docker image developed ad hoc by RPB to make easier this task. Secondly, the workflow implementation must be realized in R according to an R-skeleton function made available by RPB to guarantee homogeneity and reusability among different RPB functions. Moreover she/he has to provide the R vignette explaining the package functionality together with an example dataset which can be used to improve the user confidence in the workflow utilization.

Conclusions: Reproducible Bioinformatics Project provides a general schema and an infrastructure to distribute robust and reproducible workflows. Thus, it guarantees to final users the ability to repeat consistently any analysis independently by the used UNIX-like architecture.

Keywords: Chromatin Immuno precipitation sequencing; Community; Docker; Reproducible research; Single nucleotide variants; Whole transcriptome sequencing; microRNA sequencing.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Reproducible Bioinformatics Project structure
Fig. 2
Fig. 2
Workflows available in the stable branch of docker4seq. a Whole transcriptome sequencing workflow, b ChIP sequencing workflow, and c miRNA sequencing workflow. The names followed by parenthesis are the docker4seq functions used to execute the analysis steps. Black indicate elements in common among more than one workflow
Fig. 3
Fig. 3
Variant calling workflows under refinement in the development branch of docker4seq. a SNVs calling in DNA workflow. The function snvPreprocessing requires that users provides its own copy of the GATK software, because of Broad Institute license restrictions. This function returns a bam file sorted, with duplicates marked after GATK indel realignment and quality recalibration. b Data preprocessing for samples derived by Patient Derived Xenografths (PDX). The xenome function discriminates between the mouse host reads and the human tumor reads, then DNA or RNA SNV calling workflows can be applied. c SNVs calling in RNA workflow. The function star2steps generates a sorted bam, where duplicates are marked and processed by opossum for removal of intronic regions and merging of overlapping reads. The names followed by parenthesis are the docker4seq functions used to execute the analysis steps. Black indicate elements in common between more than one workflow
Fig. 4
Fig. 4
Variant calling workflows under development in the development branch of docker4seq. a Somatic SNVs detection using GATK MUTECT 1 or 2. b Platypus based join mutations caller. Dashed blocks are not implemented, yet
Fig. 5
Fig. 5
sncRNA workflow. The sncRNA pipeline starts from a reference composed by the set of sncRNAs that contains all sncRNA characterized by a length minor than 80 bp. Then, two types of scripts are used one dedicated to the detection of known and novel microRNAs while the other is focused on sncRNAs
Fig. 6
Fig. 6
HashClone pipeline. The HashClone strategy is organized in three steps: The first step (red box) is used to detect k-mer in all patients’ samples. The second step (green box) focus on the generation of sequence signatures leading to the identification of the set of putative clones present in each of the patients’ sample; the third step (blue box) is used to the characterization and evaluation of the cancer clones

Similar articles

Cited by

References

    1. Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533(7604):452–454. doi: 10.1038/533452a. - DOI - PubMed
    1. Lithgow GJ, Driscoll M, Phillips P. A long journey to reproducible results. Nature. 2017;548(7668):387–388. doi: 10.1038/548387a. - DOI - PMC - PubMed
    1. Searls DB. The roots of bioinformatics. PLoS Comput Biol. 2010;6(6):e1000809. doi: 10.1371/journal.pcbi.1000809. - DOI - PMC - PubMed
    1. Kanwal S, Khan FZ, Lonie A, Sinnott RO. Investigating reproducibility and tracking provenance - a genomic workflow case study. BMC Bioinf. 2017;18(1):337. doi: 10.1186/s12859-017-1747-0. - DOI - PMC - PubMed
    1. Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten simple rules for reproducible computational research. PLoS Comput Biol. 2013;9(10):e1003285. doi: 10.1371/journal.pcbi.1003285. - DOI - PMC - PubMed