Use of application containers and workflows for genomic data analysis

doi:10.4103/2153-3539.197197

. 2016 Dec 30:7:53.

doi: 10.4103/2153-3539.197197. eCollection 2016.

Use of application containers and workflows for genomic data analysis

Wade L Schulz¹, Thomas J S Durant¹, Alexa J Siddon², Richard Torres¹

Affiliations

¹ Department of Laboratory Medicine, Yale University School of Medicine, New Haven, CT, USA.
² Department of Laboratory Medicine, Yale University School of Medicine, New Haven, CT, USA; Pathology and Laboratory Medicine Service, VA Connecticut Healthcare System, West Haven, CT, USA.

PMID: 28163975
PMCID: PMC5248400
DOI: 10.4103/2153-3539.197197

Use of application containers and workflows for genomic data analysis

Wade L Schulz et al. J Pathol Inform. 2016.

. 2016 Dec 30:7:53.

doi: 10.4103/2153-3539.197197. eCollection 2016.

Authors

Wade L Schulz¹, Thomas J S Durant¹, Alexa J Siddon², Richard Torres¹

Affiliations

¹ Department of Laboratory Medicine, Yale University School of Medicine, New Haven, CT, USA.
² Department of Laboratory Medicine, Yale University School of Medicine, New Haven, CT, USA; Pathology and Laboratory Medicine Service, VA Connecticut Healthcare System, West Haven, CT, USA.

PMID: 28163975
PMCID: PMC5248400
DOI: 10.4103/2153-3539.197197

Abstract

Background: The rapid acquisition of biological data and development of computationally intensive analyses has led to a need for novel approaches to software deployment. In particular, the complexity of common analytic tools for genomics makes them difficult to deploy and decreases the reproducibility of computational experiments.

Methods: Recent technologies that allow for application virtualization, such as Docker, allow developers and bioinformaticians to isolate these applications and deploy secure, scalable platforms that have the potential to dramatically increase the efficiency of big data processing.

Results: While limitations exist, this study demonstrates a successful implementation of a pipeline with several discrete software applications for the analysis of next-generation sequencing (NGS) data.

Conclusions: With this approach, we significantly reduced the amount of time needed to perform clonal analysis from NGS data in acute myeloid leukemia.

Keywords: Big data; bioinformatics workflow; containerization; genomics.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts of interest.

Figures

**Figure 1**
Serial workflow and architecture to download Cancer Genomics Hub data. To obtain next generation sequencing data from Cancer Genomics Hub, the cgDownload utility was used to transfer aligned whole genome and whole exome sequencing data. The SomaticSniper utility was then used to identify somatic variants and tumor clonality was predicted with SciClone. These utilities were all manually configured on a server running CentOS 6.7

**Figure 2**
Comparison of standard application architecture and containerized architecture for clonal analysis. (a) When deployed in a virtual server, the analysis workflow was installed on CentOS 6.7 and had to be run serially due to limitations in software parallelization and local resources. Applications are launched manually in sequence to download NGS data, identify variants, and predict tumor clonality. (b) When configured in Docker containers and driven by a workflow manager, applications were automatically launched and able to scale based on available system resources. Each application was configured on its native operating system architecture within the container, as indicated in the figure

**Figure 3**
Disk throughput and processor efficiency of Docker containers. (a) The time needed to write a one-gigabyte file with the dd utility was similar in both a virtual machine and within a Docker container on the same host. (b) The calculation of 10,000 primes with the sysbench utility showed similar performance in a virtual machine and a Docker container on the same host

**Figure 4**
Illustration of parallelization improvements with a workflow-driven container architecture. (a) When performed serially, the download (white bars) and analysis (shaded bars) of a single pair of tumor and germline sequence on local hardware took approximately 4 h (bars drawn to scale). (b) When parallelized with a workflow manager and Docker containers, multiple specimens could be processed simultaneously to take advantage of all system resources, including network, memory, and processor capacity

See this image and copyright information in PMC

Cited by

Hot-starting software containers for STAR aligner.
Zhang P, Hung LH, Lloyd W, Yeung KY. Zhang P, et al. Gigascience. 2018 Aug 1;7(8):giy092. doi: 10.1093/gigascience/giy092. Gigascience. 2018. PMID: 30085034 Free PMC article.
Reproducible Bioconductor workflows using browser-based interactive notebooks and containers.
Almugbel R, Hung LH, Hu J, Almutairy A, Ortogero N, Tamta Y, Yeung KY. Almugbel R, et al. J Am Med Inform Assoc. 2018 Jan 1;25(1):4-12. doi: 10.1093/jamia/ocx120. J Am Med Inform Assoc. 2018. PMID: 29092073 Free PMC article.
PGSXplorer: an integrated nextflow pipeline for comprehensive quality control and polygenic score model development.
Yaraş T, Oktay Y, Karakülah G. Yaraş T, et al. PeerJ. 2025 Feb 12;13:e18973. doi: 10.7717/peerj.18973. eCollection 2025. PeerJ. 2025. PMID: 39959831 Free PMC article.
DockerBIO: web application for efficient use of bioinformatics Docker images.
Kwon C, Kim J, Ahn J. Kwon C, et al. PeerJ. 2018 Nov 27;6:e5954. doi: 10.7717/peerj.5954. eCollection 2018. PeerJ. 2018. PMID: 30515360 Free PMC article.
A complete pedigree-based graph workflow for rare candidate variant analysis.
Markello C, Huang C, Rodriguez A, Carroll A, Chang PC, Eizenga J, Markello T, Haussler D, Paten B. Markello C, et al. Genome Res. 2022 May;32(5):893-903. doi: 10.1101/gr.276387.121. Epub 2022 Apr 28. Genome Res. 2022. PMID: 35483961 Free PMC article.

See all "Cited by" articles

References

1. Krumholz HM, Waldstreicher J. The Yale Open Data Access (YODA) Project – A mechanism for data sharing. N Engl J Med. 2016;375:403–5. - PubMed
1. Collins FS, Barker AD. Mapping the cancer genome. Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies. Sci Am. 2007;296:50–7. - PubMed
1. Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci Rev. 2014;1:293–314. - PMC - PubMed
1. Nekrutenko A, Taylor J. Next-generation sequencing data interpretation: Enhancing reproducibility and accessibility. Nat Rev Genet. 2012;13:667–72. - PubMed
1. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, et al. Galaxy: A web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol. 2010;(Chapter 19) Unit 19.10.1-21. - PMC - PubMed

Grants and funding

UL1 TR001863/TR/NCATS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

[1] Krumholz HM, Waldstreicher J. The Yale Open Data Access (YODA) Project – A mechanism for data sharing. N Engl J Med. 2016;375:403–5. - PubMed

[2] Krumholz HM, Waldstreicher J. The Yale Open Data Access (YODA) Project – A mechanism for data sharing. N Engl J Med. 2016;375:403–5. - PubMed

[3] Collins FS, Barker AD. Mapping the cancer genome. Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies. Sci Am. 2007;296:50–7. - PubMed

[4] Collins FS, Barker AD. Mapping the cancer genome. Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies. Sci Am. 2007;296:50–7. - PubMed

[5] Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci Rev. 2014;1:293–314. - PMC - PubMed

[6] Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci Rev. 2014;1:293–314. - PMC - PubMed

[7] Nekrutenko A, Taylor J. Next-generation sequencing data interpretation: Enhancing reproducibility and accessibility. Nat Rev Genet. 2012;13:667–72. - PubMed

[8] Nekrutenko A, Taylor J. Next-generation sequencing data interpretation: Enhancing reproducibility and accessibility. Nat Rev Genet. 2012;13:667–72. - PubMed

[9] Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, et al. Galaxy: A web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol. 2010;(Chapter 19) Unit 19.10.1-21. - PMC - PubMed

[10] Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, et al. Galaxy: A web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol. 2010;(Chapter 19) Unit 19.10.1-21. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Use of application containers and workflows for genomic data analysis

Affiliations

Use of application containers and workflows for genomic data analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources