Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics

Benjamin J Kelly, James R Fitch, Yangqiu Hu, Donald J Corsmeier, Huachun Zhong, Amy N Wetzel, Russell D Nordquist, David L Newsom, Peter White^{1

2}

Affiliations

¹ Center for Microbial Pathogenesis, The Research Institute at Nationwide Children's Hospital, 700 Children's Drive, Columbus 43205, OH, USA.
² Department of Pediatrics, College of Medicine, The Ohio State University, Columbus, Ohio, USA.

PMID: 25600152
PMCID: PMC4333267
DOI: 10.1186/s13059-014-0577-x

Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics

Benjamin J Kelly et al. Genome Biol. 2015.

. 2015 Jan 20;16(1):6.

doi: 10.1186/s13059-014-0577-x.

Authors

Benjamin J Kelly, James R Fitch, Yangqiu Hu, Donald J Corsmeier, Huachun Zhong, Amy N Wetzel, Russell D Nordquist, David L Newsom, Peter White^{1

2}

Affiliations

¹ Center for Microbial Pathogenesis, The Research Institute at Nationwide Children's Hospital, 700 Children's Drive, Columbus 43205, OH, USA.
² Department of Pediatrics, College of Medicine, The Ohio State University, Columbus, Ohio, USA.

PMID: 25600152
PMCID: PMC4333267
DOI: 10.1186/s13059-014-0577-x

Abstract

While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of these data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/.

PubMed Disclaimer

Figures

**Figure 1**
**Churchill optimizes load balancing, resulting in improved resource utilization and faster run times.** Three different strategies for parallelization of whole genome sequencing secondary data analysis were compared: balanced (utilized by Churchill), chromosomal (utilized by HugeSeq) and scatter-gather (utilized by GATK-Queue). The resource utilization, timing and scalability of the three pipelines were assessed using sequence data for a single human genome sequence dataset (30× coverage). **(A)** CPU utilization was monitored throughout the analysis process and demonstrated that Churchill improved resource utilization (92%) when compared with HugeSeq (46%) and GATK-Queue (30%). **(B)** Analysis timing metrics generated with 8 to 48 cores demonstrated that Churchill (green) is twice as fast as HugeSeq (red), four times faster than GATK-Queue (blue), and 10 times faster than a naïve serial implementation (yellow) with in-built multithreading enabled. **(C)** Churchill scales much better than the alternatives; the speed differential between Churchill and alternatives increases as more cores in a given compute node are used.

**Figure 2**
**Churchill scales efficiently, enabling complete secondary analysis to be achieved in less than two hours.** The capability of Churchill, GATK-Queue and HugeSeq to scale analysis beyond a single compute node was evaluated. **(A)** Fold speedup as a function of the number of cores used was assessed across a cluster of four Dell® R815 servers with Churchill (green), GATK-Queue (blue), HugeSeq (red) and serial analysis (yellow). For comparison, the linear speedup (grey) and that predicted by Amdahl’s law (purple) assuming a one-hour sequential time are also included [11]. Churchill’s scalability closely matches that predicted by Amdahl’s law, achieving in excess of a 13-fold speedup between 8 and 192 cores. In contrast, both HugeSeq and GATK-Queue showed modest improvements in speed between 8 and 24 cores (2-fold), with a maximal 3-fold speedup being achieved with 48 cores, and no additional increase in speed beyond 48 cores. **(B)** Timing results for different steps of the Churchill pipeline were assessed with increasing numbers of cores. Complete human genome analysis was achieved in three hours using an in-house cluster with 192 cores and in 100 minutes at the Ohio Supercomputer Center (Glenn Cluster utilizing 700 cores). Results were confirmed using both the Pittsburgh Supercomputing Center and Amazon Web Services EC2.

**Figure 3**
**The performance of Churchill does not come at the sacrifice of data quality.** The final VCF output of Churchill (green), GATK-Queue (blue) and HugeSeq (red) was compared and evaluated against the National Institute of Standards and Technology (NIST) benchmark SNP and indel genotype calls generated by the Genome in a Bottle Consortium (GIAB) [13]. The Venn diagram shows a high degree of concordance between the three pipelines. Churchill identified the highest number of validated variants from the approximately 2.9 million calls in the GIAB dataset, for both SNPs (99.9%) and indels (93.5%), and had the highest overall sensitivity (99.7%) and accuracy (99.9988%). The Youden index (or J statistic), a function of sensitivity (true positive rate) and specificity (true negative rate), is a commonly used measure of overall diagnostic effectiveness [14].

**Figure 4**
**Churchill enables rapid secondary analysis and variant calling with GATK HaplotypeCaller using cloud computing resources.** Analysis of raw sequence data for a single human genome sequence dataset (30× coverage) was compared using Churchill and bcbio-nextgen, with both pipelines utilizing BWA-MEM for alignment and GATK HaplotypeCaller for variant detection and genotyping. **(A)** CPU utilization on a single r3.8xlarge AWS EC2 instance (32 cores) was monitored throughout the analysis process and demonstrated that Churchill improved resource utilization (94%) when compared with bcbio-nextgen (57%), enabling the entire analysis to be completed in under 12 hours with a single instance. **(B)** Unlike bcbio-nextgen, Churchill enables all steps of the analysis process to be efficiently scaled across multiple compute nodes, resulting in significantly reduced run times. With 16 AWS EC2 instances the entire analysis could be completed in 104 minutes, with the variant calling and genotyping with GATK HaplotypeCaller stage taking only 24 minutes of the total run time.

**Figure 5**
**Churchill enables population-scale whole human genome sequence analysis.** Churchill was used to analyze 1,088 of the low-coverage whole-genome samples that were included in ‘phase 1’ of the 1000 Genomes Project (1KG). Raw sequence data for the entire population were used to generate a single multi-sample VCF in 7 days using 400 AWS EC2 instances (cc2.8xlarge spot instances). The resulting Churchill filtered VCF (green) was then compared to the 1KG Consortium’s VCF (red), with Churchill calling 41.2 million variants and the 1KG VCF file containing 39.7 million. The two VCF file sets had a total of 34.4 million variant sites in common. **(A)** There were 33.2 million SNPs called in common, with validation rates against known SNPs being highly similar: 52.8% (Churchill) and 52.4% (1KG). **(B)** Churchill called three-fold more indels, of which 19.5% were known compared with 12.5% in the 1KG indel set. The indels unique to Churchill have a seven-fold higher rate of validation with known variants than those unique to 1KG. **(C)** Minor allele frequencies were compared for the 34.3 million variants with the same minor allele and a density binned scatter plot was produced (scaled from low (light blue) to high (purple) density frequencies). The results from Churchill and the original 1KG analysis demonstrated highly concordant minor allele frequencies (R² = 0.9978, P-value <2.2e-16).

See this image and copyright information in PMC

References

1. Gonzaga-Jauregui C, Lupski JR, Gibbs RA. Human genome sequencing in health and disease. Annu Rev Med. 2012;63:35–61. doi: 10.1146/annurev-med-051010-162644. - DOI - PMC - PubMed
1. Mardis ER. A decade’s perspective on DNA sequencing technology. Nature. 2011;470:198–203. doi: 10.1038/nature09796. - DOI - PubMed
1. The Boston Children’s Hospital CLARITY Challenge Consortium An international effort towards developing standards for best practices in analysis, interpretation and reporting of clinical genome sequencing results in the CLARITY Challenge. Genome Biol. 2014;15:R53. doi: 10.1186/gb-2014-15-3-r53. - DOI - PMC - PubMed
1. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38:1767–71. doi: 10.1093/nar/gkp1137. - DOI - PMC - PubMed
1. Depristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–8. doi: 10.1038/ng.806. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics

Affiliations

Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources