Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 May 2:15:125.
doi: 10.1186/1471-2105-15-125.

Effective filtering strategies to improve data quality from population-based whole exome sequencing studies

Affiliations

Effective filtering strategies to improve data quality from population-based whole exome sequencing studies

Andrew R Carson et al. BMC Bioinformatics. .

Abstract

Background: Genotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and thus remain a source of errors in whole exome sequencing (WES) projects that follow GATK's recommended best practices. Therefore, additional data filtering methods are required to effectively remove these errors before performing association analyses with complex phenotypes. Here we empirically derive thresholds for genotype and variant filters that, when used in conjunction with the VQSR tool, achieve higher data quality than when using VQSR alone.

Results: The detailed filtering strategies improve the concordance of sequenced genotypes with array genotypes from 99.33% to 99.77%; improve the percent of discordant genotypes removed from 10.5% to 69.5%; and improve the Ti/Tv ratio from 2.63 to 2.75. We also demonstrate that managing batch effects by separating samples based on different target capture and sequencing chemistry protocols results in a final data set containing 40.9% more high-quality variants. In addition, imputation is an important component of WES studies and is used to estimate common variant genotypes to generate additional markers for association analyses. As such, we demonstrate filtering methods for imputed data that improve genotype concordance from 79.3% to 99.8% while removing 99.5% of discordant genotypes.

Conclusions: The described filtering methods are advantageous for large population-based WES studies designed to identify common and rare variation associated with complex diseases. Compared to data processed through standard practices, these strategies result in substantially higher quality data for common and rare association analyses.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Summary of methods and improved data quality from genotype and variant filters. A) Left panel illustrates the standard filtering method (left side) compared to the proposed genotype and variant filtering method (right side) for sequencing data. Right panel illustrates the method used for genotype and variant filtering of imputed data. The quality metrics resulting from standard filtering (blue box), proposed genotype and variant filters (orange boxes), and a combination of these methods (green box) are compared to the quality of the unfiltered data (grey boxes). B) Quantitative comparisons of quality improvement are depicted for both sequencing and imputation filters at both genotype (% of discordant genotypes removed and % concordance) and variant (Ti/Tv and R2) levels. Box colors match the boxes in A).
Figure 2
Figure 2
Improved concordance, sensitivity and specificity of WES data using genotype filters. Plots illustrate the non-reference concordance and sensitivity versus specificity between array and sequencing genotypes for 10 samples. A) The percent of non-reference discordant calls removed is plotted versus the percent of non-reference concordant calls retained at increasing quality thresholds. B) Sensitivity versus specificity is plotted at increasing quality thresholds. For A) and B), blue line represents changing DP thresholds and the red line represents change GQ thresholds. Chosen filter thresholds (DP ≥ 8 and GQ ≥ 20) are indicated by points on these lines. C) Summarizes the effect that the chosen genotype filters (both DP and GQ) have on non-reference concordant and discordant genotype calls with and without the VQSR filter.
Figure 3
Figure 3
Improved Ti/Tv ratios in WES data using variant filters. Plots illustrate Ti/Tv improvement at different thresholds of A) average GQ and B) call rate. In these plots, Ti/Tv ratios (blue) for novel (dotted line), known (dashed line) and true (solid line) variants are juxtaposed against the drop in sensitivity (red) as the variant filtering thresholds increase. Chosen thresholds are show by the red dashed lines. C) The Ti/Tv improvement after each of the variant filtering steps is summarized. In addition, the result from an alternative filtering order, where VQSR is applied prior to the combined variant filters, is also displayed (green circle). *Combined filters refers to HWE, average GQ and call rate filters applied together.
Figure 4
Figure 4
Applying a GQ filter improves the quality of imputation results from WES data. Plots illustrate data quality improvement seen after applying GQ threshold. A) Plots the average concordance (blue line) improvement between array and sequencing genotypes for 10 samples as the GQ threshold increases. Coupled with this concordance improvement is the average percent of genotypes that remain (red line) with GQ values above that threshold. B) At the GQ > 20 threshold, this plot shows that variants removed (blue) due to loss of all genotypes have generally lower quality (as measured by R2) compared to variants containing at least one genotype (red). Mean values for each distribution are shown by the dotted lines.

Similar articles

Cited by

References

    1. Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet. 2001;69(1):124–137. doi: 10.1086/321272. - DOI - PMC - PubMed
    1. Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant…or not? Hum Mol Genet. 2002;11(20):2417–2423. doi: 10.1093/hmg/11.20.2417. - DOI - PubMed
    1. Kryukov GV, Pennacchio LA, Sunyaev SR. Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am J Hum Genet. 2007;80(4):727–739. doi: 10.1086/513473. - DOI - PMC - PubMed
    1. Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR. Power of deep, all-exon resequencing for discovery of human trait genes. Proc Natl Acad Sci U S A. 2009;106(10):3871–3876. doi: 10.1073/pnas.0812824106. - DOI - PMC - PubMed
    1. Kiezun A, Garimella K, Do R, Stitziel NO, Neale BM, McLaren PJ, Gupta N, Sklar P, Sullivan PF, Moran JL, Hultman CM, Lichtenstein P, Magnusson P, Lehner T, Shugart YY, Price AL, de Bakker PI, Purcell SM, Sunyaev SR. Exome sequencing and the genetic basis of complex traits. Nat Genet. 2012;44(6):623–630. doi: 10.1038/ng.2303. - DOI - PMC - PubMed

Publication types

MeSH terms