. 2014 Oct 28;6(10):89.

doi: 10.1186/s13073-014-0089-z. eCollection 2014.

Reducing INDEL calling errors in whole genome and exome sequencing data

Affiliations

¹ Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA ; Stony Brook University, 100 Nicolls Rd, Stony Brook, NY USA ; Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA.
² Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA ; Stony Brook University, 100 Nicolls Rd, Stony Brook, NY USA.
³ Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA ; New York Genome Center, New York, NY USA.
⁴ Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA ; Centro de Ciencias Genomicas, Universidad Nacional Autonoma de Mexico, Cuernavaca, Morelos Mexico.
⁵ Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA.

PMID: 25426171
PMCID: PMC4240813
DOI: 10.1186/s13073-014-0089-z

Reducing INDEL calling errors in whole genome and exome sequencing data

Han Fang et al. Genome Med. 2014.

. 2014 Oct 28;6(10):89.

doi: 10.1186/s13073-014-0089-z. eCollection 2014.

Authors

Affiliations

¹ Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA ; Stony Brook University, 100 Nicolls Rd, Stony Brook, NY USA ; Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA.
² Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA ; Stony Brook University, 100 Nicolls Rd, Stony Brook, NY USA.
³ Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA ; New York Genome Center, New York, NY USA.
⁴ Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA ; Centro de Ciencias Genomicas, Universidad Nacional Autonoma de Mexico, Cuernavaca, Morelos Mexico.
⁵ Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY USA.

PMID: 25426171
PMCID: PMC4240813
DOI: 10.1186/s13073-014-0089-z

Abstract

Background: INDELs, especially those disrupting protein-coding regions of the genome, have been strongly associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts.

Methods: We characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate the sources of INDEL errors. We also developed a classification scheme based on the coverage and composition to rank high and low quality INDEL calls. We performed a large-scale validation experiment on 600 loci, and find high-quality INDELs to have a substantially lower error rate than low-quality INDELs (7% vs. 51%).

Results: Simulation and experimental data show that assembly based callers are significantly more sensitive and robust for detecting large INDELs (>5 bp) than alignment based callers, consistent with published data. The concordance of INDEL detection between WGS and WES is low (53%), and WGS data uniquely identifies 10.8-fold more high-quality INDELs. The validation rate for WGS-specific INDELs is also much higher than that for WES-specific INDELs (84% vs. 57%), and WES misses many large INDELs. In addition, the concordance for INDEL detection between standard WGS and PCR-free sequencing is 71%, and standard WGS data uniquely identifies 6.3-fold more low-quality INDELs. Furthermore, accurate detection with Scalpel of heterozygous INDELs requires 1.2-fold higher coverage than that for homozygous INDELs. Lastly, homopolymer A/T INDELs are a major source of low-quality INDEL calls, and they are highly enriched in the WES data.

Conclusions: Overall, we show that accuracy of INDEL detection with WGS is much greater than WES even in the targeted region. We calculated that 60X WGS depth of coverage from the HiSeq platform is needed to recover 95% of INDELs detected by Scalpel. While this is higher than current sequencing practice, the deeper coverage may save total project costs because of the greater accuracy and sensitivity. Finally, we investigate sources of INDEL errors (for example, capture deficiency, PCR amplification, homopolymers) with various data that will serve as a guideline to effectively reduce INDEL errors in genome sequencing.

PubMed Disclaimer

Figures

**Figure 1**
**Performance comparison between the Scalpel and GATK-UnifiedGenotyper in terms of sensitivity (A) and false discovery rate (B) at different coverage based on simulation data.** Each dot represents one down-sampled experiment. Round dots represent performance of general INDELs (that is, INDELs of size starting at 1 bp) and triangles represent performance of large INDELs (that is, INDELs of size greater than 5 bp). The data of Scalpel are shown in blue while GATK-UnifiedGenotyper are shown in green.

**Figure 2**
**Mean concordance of INDELs over eight samples between WGS (blue) and WES (green) data.** Venn diagram showing the numbers and percentage of shared between data types based on **(A)** Exact-match **(B)** Position-match. The mean concordance rate increased when we required at least a certain number of reads in both data (Table 1).

**Figure 3**
**Coverage distributions and fractions of the exonic targeted regions.** The coverage distributions of the exonic targeted regions in **(A)** the WGS data, **(B)** the WES data. The Y-axis for (A) and (B) is of log10-scale. The coverage fractions of the exonic targeted regions from 1X to 51X in **(C)** the WGS data, **(D)** the WES data.

**Figure 4**
**Coverage distributions and fractions of the WGS-specific INDELs regions.** The coverage distributions of the WGS-specific INDELs regions in **(A)** the WGS data, **(B)** the WES data. The Y-axis for (A) and (B) is of log10-scale. The coverage fractions of the WGS-specific INDELs regions from 1X to 51X in **(C)** the WGS data, **(D)** the WES data.

**Figure 5**
**Percentage of high quality, moderate quality, and low quality INDELs in three call sets.** From left to the right are: the WGS-WES intersection INDELs, the WGS-specific INDELs, the WES-specific INDELs. The numbers on top of a call set represent the mean number of INDELs in that call set over eight samples.

**Figure 6**
**Percentage of poly-A, poly-C, poly-G, poly-T, other-STR, and non-STR in three call sets. (A)** High-quality INDELs, **(B)** low-quality INDELs. In both figures, from left to the right are WGS-WES intersection INDELs, WGS-specific INDELs, and WES-specific INDELs.

**Figure 7**
**Numbers of genomic locations containing multiple signature INDELs in WGS (blue) and WES data (green).** The height of the bar represents the mean across eight samples and the error bar represents the standard deviation across eight samples.

**Figure 8**
**Percentage of reads near regions of Non-homopolymer, poly-N, poly-A, poly-C, poly-G, poly-T in (A) WGS data, (B) WES data.** In both figures, from left to right are exonic targeted regions, WGS-WES intersection INDELs, WGS-specific INDELs, and WES-specific INDELs.

**Figure 9**
**Concordance of INDEL detection between PCR-free and standard WGS data on NA12878.** Venn diagram showing the numbers and percentage of shared between data types based on **(A)** exact-match and **(B)** position-match.

**Figure 10**
**Percentage of high-quality, moderate-quality, and low-quality INDELs in two data sets.** From left to the right are: the PCR-free and standard WGS INDELs, the PCR-free-specific INDELs, the standard-WGS-specific INDELs. The numbers on top of a call set represent the number of INDELs in that call set.

**Figure 11**
**Percentage of poly-A, poly-C, poly-G, poly-T, other-STR, and non-STR in (A) high-quality INDELs and (B) low-quality INDELs.** In both figures, from left to the right are PCR-free and standard WGS INDELs, INDELs specific to PCR-free data, and INDELs specific to standard WGS data.

**Figure 12**
**Sensitivity performance of INDEL detection with eight WGS data sets at different mean coverages on Illumina HiSeq2000 platform.** The Y-axis represents the percentage of the WGS-WES intersection INDELs revealed at a certain lower mean coverage. **(A)** Sensitivity performance of INDEL detection with respects with each sample, **(B)** Sensitivity performance of heterozygous (blue) and homozygous (green) INDEL detection were shown separately.

See this image and copyright information in PMC

References

1. Gudmundsson J, Sulem P, Gudbjartsson DF, Masson G, Agnarsson BA, Benediktsdottir KR, Sigurdsson A, Magnusson OT, Gudjonsson SA, Magnusdottir DN, Johannsdottir H, Helgadottir HT, Stacey SN, Jonasdottir N, Olafsdottir SB, Thorleifsson G, Jonasson JG, Tryggvadottir L, Navarrete S, Fuertes F, Helfand BT, Hu Q, Csiki IE, Mates IN, Jinga V, Aben KKH, van Oort IM, Vermeulen SH, Donovan JL, Hamdy FC, et al. A study based on whole-genome sequencing yields a rare variant at 8q24 associated with prostate cancer. Nat Genet. 2012;44:1326–1329. doi: 10.1038/ng.2437. - DOI - PMC - PubMed
1. Rope AF, Wang K, Evjenth R, Xing J, Johnston JJ, Swensen JJ, Johnson WE, Moore B, Huff CD, Bird LM, Carey JC, Opitz JM, Stevens CA, Jiang T, Schank C, Fain HD, Robison R, Dalley B, Chin S, South ST, Pysher TJ, Jorde LB, Hakonarson H, Lillehaug JR, Biesecker LG, Yandell M, Arnesen T, Lyon GJ. Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. Am J Hum Genet. 2011;89:28–43. doi: 10.1016/j.ajhg.2011.05.017. - DOI - PMC - PubMed
1. Biesecker LG, Green RC. Diagnostic clinical genome and exome sequencing. N Engl J Med. 2014;370:2418–2425. doi: 10.1056/NEJMra1312543. - DOI - PubMed
1. Patel CJ, Sivadas A, Tabassum R, Preeprem T, Zhao J, Arafat D, Chen R, Morgan AA, Martin GS, Brigham KL, Butte AJ, Gibson G. Whole genome sequencing in support of wellness and health maintenance. Genome Med. 2013;5:58. doi: 10.1186/gm462. - DOI - PMC - PubMed
1. O'Rawe JA, Fang H, Rynearson S, Robison R, Kiruluta ES, Higgins G, Eilbeck K, Reese MG, Lyon GJ. Integrating precision medicine in the study and clinical treatment of a severely mentally ill person. Peer J. 2013;1:e177. doi: 10.7717/peerj.177. - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reducing INDEL calling errors in whole genome and exome sequencing data

Affiliations

Reducing INDEL calling errors in whole genome and exome sequencing data

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources