An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data

Goo Jun¹, Mary Kate Wing², Gonçalo R Abecasis², Hyun Min Kang²

Affiliations

¹ Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA; Center for Statistical Genetics and Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109, USA.
² Center for Statistical Genetics and Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109, USA.

PMID: 25883319
PMCID: PMC4448687
DOI: 10.1101/gr.176552.114

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data

Goo Jun et al. Genome Res. 2015 Jun.

. 2015 Jun;25(6):918-25.

doi: 10.1101/gr.176552.114. Epub 2015 Apr 16.

Authors

Goo Jun¹, Mary Kate Wing², Gonçalo R Abecasis², Hyun Min Kang²

Affiliations

¹ Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA; Center for Statistical Genetics and Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109, USA.
² Center for Statistical Genetics and Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109, USA.

PMID: 25883319
PMCID: PMC4448687
DOI: 10.1101/gr.176552.114

Abstract

The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies.

PubMed Disclaimer

Figures

**Figure 1.**
Outline of GotCloud variant calling pipeline.

**Figure 2.**
Computational costs of GATK UnifiedGenotyper and GotCloud pipelines. (A) Runtime estimated for whole-genome (6×) and whole-exome (60×) sequence data running with 40 parallel sessions on a four-node minicluster with 48 physical CPU cores. For GATK, runtimes for 1000 samples are extrapolated from analyses of a single 5-Mb block of Chromosome 20. For all other analyses, no extrapolation was used. (B) Peak memory usage estimates averaged over Chromosome 20 chunks.

**Figure 3.**
Variant discovery sensitivity comparison of GotCloud and GATK using HumanExome BeadChip excluding the SNPs contained in Omni2.5 array, because Omni2.5 variants are used to train variant filters in GotCloud and GATK. GotCloud results are shown for unfiltered (raw) and SVM-filtered sets, and GATK results are shown for unfiltered and VQSR-filtered sets, across low-coverage genome (A) and exome data (B).

**Figure 4.**
Comparison of SVM filtering with hard filtering based on a single feature. (A) Ts/Tv of filtered-in (PASS) and filtered-out (FAIL) variants using different filters. Variants are ordered by a single variant feature and a fixed fraction of variants (8%) are filtered out to match the variant counts with the default SVM filter. Absolute values are used for StrandBias correlation and CycleBias correlation. (B) Percentage of filtered-out HumanExome BeadChip (Omni2.5) variants among those that are polymorphic in the array genotypes.

**Figure 5.**
Impact of model-transfer filtering on the variant quality. The vertical axis represents Ts/Tv for novel SNPs for model-transferred SVM and self-trained SVM filters. Ts/Tv for known SNPs are ∼3.5, which is higher than novel SNPs because known SNPs contain a larger fraction of synonymous SNPs. The horizontal axis represents the size of targeted regions randomly selected from 500 whole-exome sequences. The transferred model is trained on nonoverlapping 500 exome data.

**Figure 6.**
Comparison of known and novel Ts/Tv with and without trimming overlapping reads for 1000 low-coverage Chromosome 20 sequences from the 1000 Genomes Project. Overlapping fragments lowers novel Ts/Tv and the effect is more eminent in the singletons (with allele count of one).

**Figure 7.**
Nonreference genotype concordance for low-pass genome data calculated using Omni2.5 array genotypes. The haplotype-aware refinement steps significantly improve genotype accuracy, especially with larger sample sizes.

See this image and copyright information in PMC

References

1. The 1000 Genomes Project Consortium. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. - PMC - PubMed
1. Amari S, Wu S. 1999. Improving support vector machine classifiers by modifying kernel functions. Neural Netw 12: 783–789. - PubMed
1. Browning BL, Yu Z. 2009. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am J Hum Genet 85: 847–861. - PMC - PubMed
1. Chang CC, Lin CJ. 2011. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2: 27:1–27:27.
1. Cortes C, Vapnik V. 1995. Support-vector networks. Mach Learn 20: 273–297.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data

Affiliations

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases