Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 14;20(1):50.
doi: 10.1186/s13059-019-1659-6.

Analysis of error profiles in deep next-generation sequencing data

Affiliations

Analysis of error profiles in deep next-generation sequencing data

Xiaotu Ma et al. Genome Biol. .

Abstract

Background: Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions.

Results: By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10-5 to 10-4, which is 10- to 100-fold lower than generally considered achievable (10-3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10-5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10-4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression.

Conclusions: We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.

Keywords: Deep sequencing; Detection; Error rate; Hotspot mutation; Subclonal; Substitution.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

A pending patent application has been filed based on the research disclosed in this manuscript; the patent does not restrict the research use of the findings in this article. The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Potential error sources in next-generation sequencing workflow. a Illustration of the major steps of a typical next-generation sequencing workflow. Targeted deep sequencing is usually done by amplicon protocol or hybridization-capture protocol. Potential error sources are indicated by numbers. b Percentage of high-quality (Q30) bases by position in NGS read. This shows that the first and the last 5 bp have lower percentages of high-quality bases than do other positions. c Cumulative plot of NGS read quality distribution categorized by low-quality mapping (MAPQ < 55), potentially problematic alignment (“Methods”), and number of poor-quality bases in read (from ≥ 16 bp to 0 bp per read)
Fig. 2
Fig. 2
Comparison of sequencing errors with known somatic mutations in deep sequencing data generated from diluted COLO829 cancer cell line. a Error rate (y-axis) in BRAF V600 amplicon (x-axis: chr7 positions) under standard pileup (top) and CleanDeepSeq (bottom). A>T errors are shown in red and other errors shown in gray. Known somatic mutation BRAF V600E is shown in purple. Also shown are error rates summarized at sample level by pileup (left panels, “Methods”) or CleanDeepSeq (right panels) for 1:1000 dilution (b) and 1:5000 dilution (c). The 12 possible substitution patterns (first parenthesis) are depicted in rows. Median error rates (log10 scale) are indicated on the left, and sample sizes (number of genomic sites) for the histogram are indicated on the right in the second parenthesis. The x-axis displays the error rate in log10 scale. The designed MAF ladders for the known somatic mutations were depicted using red, blue, and black lines labeled on top, and the known somatic mutations were colored according to their expected MAF. Black arrow: BRAF V600E, which has 4 mutant alleles and 2 wildtype alleles in COLO829, so that at 1:1000 dilution and 1:5000 dilution the expected MAF are 0.002 and 0.0004, respectively (“Methods”)
Fig. 3
Fig. 3
Context dependency of C>T/G>A errors in deep sequencing data generated from diluted COLO829 cancer cell line. C>T (left panels) and G>A (right panels) errors are decomposed into 16 contexts by including one 5′ base and one 3′ base for 1:1000 dilution (a) and 1:5000 dilution (b), respectively. Contexts showing elevated error rate are marked with an asterisk “*”. See Fig. 2 for legends
Fig. 4
Fig. 4
Error profile in NovaSeq + Q5 dataset generated by StJude (a, b, c) and HAIB (d, e, f). a, d Error rate (y-axis) in BRAF V600E amplicon (x-axis: chr7 positions) under direct pileup (top) and CleanDeepSeq (bottom). Also shown are error rates of the 12 change types across two dilutions: b, e 1:1000 dilution; c, f 1:5000 dilution, see Fig. 2 for legends
Fig. 5
Fig. 5
Sample-specific errors in high-depth capture sequencing data. Each column represents a leukemia sample (in total 47 samples) while each row represents a genomic position that was sequenced in all samples. The genomic positions were assigned to panels ad by the nucleotide at corresponding positions, i.e., C at (a), G at (b), A at (c), and T at (d) as heatmaps. In each panel, MAF for all three possible substitution types were shown in three groups indicated at the top of each panel, sorted by their neighboring DNA context (i.e., 3′ (−) or 5′ (+) flanking bases). Vertical patterns show the sample-level DNA damage which is apparent in C>A and G>T mutation. e Significant correlation of sample-specific error (surrogated by C>A error rate) with error types C>T/G>A and C>G/G>C but not for other type (data not shown). The linear regression and r-squared values are indicated
Fig. 6
Fig. 6
Genome-wide average error rate in neuroblastoma datasets (panels a, b) and an AML dataset (panel c). Shown are histogram of genome-wide average error rate (“Methods”) by standard pileup (left panels) and CleanDeepSeq (right panels). In the neuroblastoma dataset (generated by Broad Institute, “Methods”), the Exome_Native subset (b) is known to have sample-level damages while the Exome_WGA subset (a) does not have sample-level damages. Also included are an AML dataset (generated by Baylor College of Medicine) (c). Red vertical lines and numbers indicate median
Fig. 7
Fig. 7
Error rate comparison between hybridization-capture and aggregated WGS datasets. Summary statistics (“Methods”) are calculated with 99th percentile (P = 3 × 10−4; a) and 99.9th percentile (P = 2 × 10−5; b). We also tried 90th percentile but the linear fitting is poor (r2 = 0.47; slope = 4.4; data not shown) due to the fact that many loci have MAF of 0 as described in the “Methods” section

References

    1. Salk JJ, Schmitt MW, Loeb LA. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet. 2018;19:269–285. doi: 10.1038/nrg.2017.117. - DOI - PMC - PubMed
    1. Ma X, Edmonson M, Yergeau D, Muzny DM, Hampton OA, Rusch M, Song G, Easton J, Harvey RC, Wheeler DA, et al. Rise and fall of subclones from diagnosis to relapse in pediatric B-acute lymphoblastic leukaemia. Nat Commun. 2015;6:6604. doi: 10.1038/ncomms7604. - DOI - PMC - PubMed
    1. Zhang J, Walsh MF, Wu G, Edmonson MN, Gruber TA, Easton J, Hedges D, Ma X, Zhou X, Yergeau DA, et al. Germline mutations in predisposition genes in pediatric cancer. N Engl J Med. 2015;373:2336–2346. doi: 10.1056/NEJMoa1508054. - DOI - PMC - PubMed
    1. Prochazkova K, Pavlikova K, Minarik M, Sumerauer D, Kodet R, Sedlacek Z. Somatic TP53 mutation mosaicism in a patient with Li-Fraumeni syndrome. Am J Med Genet A. 2009;149A:206–211. doi: 10.1002/ajmg.a.32574. - DOI - PubMed
    1. Genovese G, Kahler AK, Handsaker RE, Lindberg J, Rose SA, Bakhoum SF, Chambert K, Mick E, Neale BM, Fromer M, et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N Engl J Med. 2014;371:2477–2487. doi: 10.1056/NEJMoa1409405. - DOI - PMC - PubMed

Publication types