Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Oct 15:8:64.
doi: 10.1186/s12920-015-0134-9.

Scalable and cost-effective NGS genotyping in the cloud

Affiliations

Scalable and cost-effective NGS genotyping in the cloud

Yassine Souilmi et al. BMC Med Genomics. .

Abstract

Background: While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10's of dollars.

Results: We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets.

Conclusions: Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
GenomeKey workflow and overall benchmarking study design. a GenomeKey workflow implements the GATK 3 best practices for genomic variant calling. Each arrow represents a stage of the workflow, and the level of parallelization for each stage is described in the Methods section under “Workflow”. b Deployment of the workflow on the Amazon Web Services Elastic Compute Cloud (EC2) infrastructure using the COSMOS workflow management engine
Fig. 2
Fig. 2
GenomeKey scalability. GenomeKey workflow efficiently scales with increasing number of genomes. a Wall time and (b) cost as a function of number of genomes compared to a linear extrapolation single genome. GenomeKey workflow scales efficiently with increasing number of exomes compared on different GlusterFS configurations. The blue curve represents the 1, 3, 5 and 10 exomes runs performed on a cluster with one GlusterFS brick; the yellow curve represents the scalability on a cluster with four GlusterFS bricks. c Wall time and (d) cost as a function of exome and size for as compared to a linear extrapolation of a single exome
Fig 3
Fig 3
Cluster Resources Usage. Cluster resources are utilized more efficiently as batch size increases. When the number of exomes increases from (a) 5 exomes to (b) 10 exomes, overall cluster CPU usage (shown as the brown “Total” line) is higher across the entire runtime. Percent CPU usage for each job across the entire 20-node was summed within 5-min “wall time” windows and then scaled by the total number of cores (20 nodes × 32 cores/node = 1920 cores) to quantify the overall system utilization. CPU usage for jobs not fully contained within each 5 min’ window was pro-rated according to how much they overlapped. The contribution of each stage to the entire total (brown line) as a function of time further illustrates the parallelization

References

    1. Kircher M, Kelso J. High-throughput DNA sequencing--concepts and limitations. Bioessays. 2010;32(6):524–536. doi: 10.1002/bies.200900181. - DOI - PubMed
    1. Schatz MC, Langmead B. The DNA data deluge: fast, efficient genome sequencing machines are spewing out more data than geneticists can analyze. IEEE Spectr. 2013;50(7):26–33. doi: 10.1109/MSPEC.2013.6545119. - DOI - PMC - PubMed
    1. Desai AN, Jere A. Next-generation sequencing: ready for the clinics? Clin Genet. 2012;81(6):503–510. doi: 10.1111/j.1399-0004.2012.01865.x. - DOI - PubMed
    1. Sboner A, Mu XJ, Greenbaum D, Auerbach RK, Gerstein MB. The real cost of sequencing: higher than you think! Genome Biol. 2011;12(8):125. doi: 10.1186/gb-2011-12-8-125. - DOI - PMC - PubMed
    1. Life Technologies Receives FDA 510(k) Clearance for Diagnostic Use of Sanger Sequencing Platform and HLA Typing Kits [https://www.genomeweb.com/sequencing/510k-clearance-3500-dx-life-tech-ai...]

Publication types

LinkOut - more resources