Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jan 29;9(1):e86664.
doi: 10.1371/journal.pone.0086664. eCollection 2014.

Comprehensive analysis to improve the validation rate for single nucleotide variants detected by next-generation sequencing

Affiliations

Comprehensive analysis to improve the validation rate for single nucleotide variants detected by next-generation sequencing

Mi-Hyun Park et al. PLoS One. .

Abstract

Next-generation sequencing (NGS) has enabled the high-throughput discovery of germline and somatic mutations. However, NGS-based variant detection is still prone to errors, resulting in inaccurate variant calls. Here, we categorized the variants detected by NGS according to total read depth (TD) and SNP quality (SNPQ), and performed Sanger sequencing with 348 selected non-synonymous single nucleotide variants (SNVs) for validation. Using the SAMtools and GATK algorithms, the validation rate was positively correlated with SNPQ but showed no correlation with TD. In addition, common variants called by both programs had a higher validation rate than caller-specific variants. We further examined several parameters to improve the validation rate, and found that strand bias (SB) was a key parameter. SB in NGS data showed a strong difference between the variants passing validation and those that failed validation, showing a validation rate of more than 92% (filtering cutoff value: alternate allele forward [AF] ≥ 20 and AF<80 in SAMtools, SB<-10 in GATK). Moreover, the validation rate increased significantly (up to 97-99%) when the variant was filtered together with the suggested values of mapping quality (MQ), SNPQ and SB. This detailed and systematic study provides comprehensive recommendations for improving validation rates, saving time and lowering cost in NGS analyses.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have the following interests. Hwanseok Rhee and Jung Hoon Park are employed by Macrogen Inc., a company that markets NGS services. There are no further patents, products in development or marketed products to declare. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials, as detailed online in the guide for authors.

Figures

Figure 1
Figure 1. Pipelines for calling single nucleotide variants (SNVs). SNVs were called in four sets, based on SAMtools: mpileup (SNP set1 and SNP set3) and GATK: unified genotyper (SNP set2 and SNP set4).
The numbers of reads and SNPs for individual steps are given for one exome-seq data set, generated using a Solexa GAIIx Genome Analyzer.
Figure 2
Figure 2. Distribution of validation rate according to SNP quality (SNPQ) and total read depth (TD).
(A) Validation rate of SNPQ for SAMtools (SNP set1 and SNP set3). (B) Validation rate of SNPQ for GATK (SNP set2 and SNP set4). (C) Validation rate of TD for SNP set1–4.
Figure 3
Figure 3. Diagram and validation rate of common variants.
(A) The diagram of common variants among the four types of SNP sets. (B) The validation rates of common variants and caller-specific variants.
Figure 4
Figure 4. Evaluation of analysis parameters for improving validation rates.
(A) Distribution of validation rates according to genotype quality (GQ) values. (B) Distribution of validation rates according to mapping quality (MQ) values. (C) Distribution of validation rates according to alternate allele forward (AF) percent for SAMtools (SNP set1 and SNP set3). (D) Distribution of validation rates according to strand bias (SB) values for GATK (SNP set2 and SNP set4).
Figure 5
Figure 5. Validation rates according to suggested cutoff values of parameters are shown in table (A) and graph (B) form using the SAMtools algorithm after realignment and recalibration.
A+B+C: filtered variants together with suggested values of SNPQ, GQ, and SB.
Figure 6
Figure 6. Validation rates according to suggested cutoff values of parameters are shown in table (A) and graph (B) form using the GATK algorithm after realignment and recalibration.
A+B+C: filtered variants together with suggested values of SNPQ, GQ, and SB.

References

    1. Metzker ML (2010) Sequencing technologies - the next generation. Nat Rev Genet 11: 31–46. - PubMed
    1. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, et al. (2011) Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 12: 745–755. - PubMed
    1. Bainbridge MN, Wang M, Burgess DL, Kovar C, Rodesch MJ, et al. (2010) Whole exome capture in solution with 3 Gbp of data. Genome Biol 11: R62. - PMC - PubMed
    1. Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, et al. (2011) Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 39: e90. - PMC - PubMed
    1. Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10: R83. - PMC - PubMed

Publication types