. 2019 Mar 14;20(1):50.

doi: 10.1186/s13059-019-1659-6.

Analysis of error profiles in deep next-generation sequencing data

Xiaotu Ma¹, Ying Shao², Liqing Tian², Diane A Flasch², Heather L Mulder², Michael N Edmonson², Yu Liu², Xiang Chen², Scott Newman², Joy Nakitandwe³, Yongjin Li², Benshang Li⁴, Shuhong Shen⁴, Zhaoming Wang^{2

5}, Sheila Shurtleff³, Leslie L Robison⁵, Shawn Levy⁶, John Easton², Jinghui Zhang⁷

Affiliations

¹ Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA. Xiaotu.Ma@stjude.org.
² Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA.
³ Department of Pathology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA.
⁴ Key Laboratory of Pediatric Hematology and Oncology Ministry of Health, Department of Hematology and Oncology, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, 200127, China.
⁵ Department of Epidemiology and Cancer Control, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA.
⁶ HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA.
⁷ Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA. Jinghui.Zhang@stjude.org.

PMID: 30867008
PMCID: PMC6417284
DOI: 10.1186/s13059-019-1659-6

Analysis of error profiles in deep next-generation sequencing data

Xiaotu Ma et al. Genome Biol. 2019.

. 2019 Mar 14;20(1):50.

doi: 10.1186/s13059-019-1659-6.

Authors

Affiliations

¹ Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA. Xiaotu.Ma@stjude.org.
² Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA.
³ Department of Pathology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA.
⁴ Key Laboratory of Pediatric Hematology and Oncology Ministry of Health, Department of Hematology and Oncology, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, 200127, China.
⁵ Department of Epidemiology and Cancer Control, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA.
⁶ HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA.
⁷ Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA. Jinghui.Zhang@stjude.org.

PMID: 30867008
PMCID: PMC6417284
DOI: 10.1186/s13059-019-1659-6

Abstract

Background: Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions.

Results: By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10^-5 to 10^-4, which is 10- to 100-fold lower than generally considered achievable (10^-3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10^-5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10^-4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression.

Conclusions: We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.

Keywords: Deep sequencing; Detection; Error rate; Hotspot mutation; Subclonal; Substitution.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

A pending patent application has been filed based on the research disclosed in this manuscript; the patent does not restrict the research use of the findings in this article. The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Potential error sources in next-generation sequencing workflow. a Illustration of the major steps of a typical next-generation sequencing workflow. Targeted deep sequencing is usually done by amplicon protocol or hybridization-capture protocol. Potential error sources are indicated by numbers. b Percentage of high-quality (Q30) bases by position in NGS read. This shows that the first and the last 5 bp have lower percentages of high-quality bases than do other positions. c Cumulative plot of NGS read quality distribution categorized by low-quality mapping (MAPQ < 55), potentially problematic alignment (“Methods”), and number of poor-quality bases in read (from ≥ 16 bp to 0 bp per read)

**Fig. 2**
Comparison of sequencing errors with known somatic mutations in deep sequencing data generated from diluted COLO829 cancer cell line. a Error rate (y-axis) in *BRAF* V600 amplicon (x-axis: chr7 positions) under standard pileup (top) and CleanDeepSeq (bottom). A>T errors are shown in red and other errors shown in gray. Known somatic mutation *BRAF* V600E is shown in purple. Also shown are error rates summarized at sample level by pileup (left panels, “Methods”) or CleanDeepSeq (right panels) for 1:1000 dilution (b) and 1:5000 dilution (c). The 12 possible substitution patterns (first parenthesis) are depicted in rows. Median error rates (log10 scale) are indicated on the left, and sample sizes (number of genomic sites) for the histogram are indicated on the right in the second parenthesis. The x-axis displays the error rate in log10 scale. The designed MAF ladders for the known somatic mutations were depicted using red, blue, and black lines labeled on top, and the known somatic mutations were colored according to their expected MAF. Black arrow: *BRAF* V600E, which has 4 mutant alleles and 2 wildtype alleles in COLO829, so that at 1:1000 dilution and 1:5000 dilution the expected MAF are 0.002 and 0.0004, respectively (“Methods”)

**Fig. 3**
Context dependency of C>T/G>A errors in deep sequencing data generated from diluted COLO829 cancer cell line. C>T (left panels) and G>A (right panels) errors are decomposed into 16 contexts by including one 5′ base and one 3′ base for 1:1000 dilution (a) and 1:5000 dilution (b), respectively. Contexts showing elevated error rate are marked with an asterisk “*”. See Fig. 2 for legends

**Fig. 4**
Error profile in NovaSeq + Q5 dataset generated by StJude (a, b, c) and HAIB (d, e, f). a, d Error rate (y-axis) in *BRAF* V600E amplicon (x-axis: chr7 positions) under direct pileup (top) and CleanDeepSeq (bottom). Also shown are error rates of the 12 change types across two dilutions: b, e 1:1000 dilution; c, f 1:5000 dilution, see Fig. 2 for legends

**Fig. 5**
Sample-specific errors in high-depth capture sequencing data. Each column represents a leukemia sample (in total 47 samples) while each row represents a genomic position that was sequenced in all samples. The genomic positions were assigned to panels a–d by the nucleotide at corresponding positions, i.e., C at (a), G at (b), A at (c), and T at (d) as heatmaps. In each panel, MAF for all three possible substitution types were shown in three groups indicated at the top of each panel, sorted by their neighboring DNA context (i.e., 3′ (−) or 5′ (+) flanking bases). Vertical patterns show the sample-level DNA damage which is apparent in C>A and G>T mutation. e Significant correlation of sample-specific error (surrogated by C>A error rate) with error types C>T/G>A and C>G/G>C but not for other type (data not shown). The linear regression and r-squared values are indicated

**Fig. 6**
Genome-wide average error rate in neuroblastoma datasets (panels a, b) and an AML dataset (panel c). Shown are histogram of genome-wide average error rate (“Methods”) by standard pileup (left panels) and CleanDeepSeq (right panels). In the neuroblastoma dataset (generated by Broad Institute, “Methods”), the Exome_Native subset (b) is known to have sample-level damages while the Exome_WGA subset (a) does not have sample-level damages. Also included are an AML dataset (generated by Baylor College of Medicine) (c). Red vertical lines and numbers indicate median

**Fig. 7**
Error rate comparison between hybridization-capture and aggregated WGS datasets. Summary statistics (“Methods”) are calculated with 99th percentile (P = 3 × 10⁻⁴; a) and 99.9th percentile (P = 2 × 10⁻⁵; b). We also tried 90th percentile but the linear fitting is poor (r² = 0.47; slope = 4.4; data not shown) due to the fact that many loci have MAF of 0 as described in the “Methods” section

See this image and copyright information in PMC

References

1. Salk JJ, Schmitt MW, Loeb LA. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet. 2018;19:269–285. doi: 10.1038/nrg.2017.117. - DOI - PMC - PubMed
1. Ma X, Edmonson M, Yergeau D, Muzny DM, Hampton OA, Rusch M, Song G, Easton J, Harvey RC, Wheeler DA, et al. Rise and fall of subclones from diagnosis to relapse in pediatric B-acute lymphoblastic leukaemia. Nat Commun. 2015;6:6604. doi: 10.1038/ncomms7604. - DOI - PMC - PubMed
1. Zhang J, Walsh MF, Wu G, Edmonson MN, Gruber TA, Easton J, Hedges D, Ma X, Zhou X, Yergeau DA, et al. Germline mutations in predisposition genes in pediatric cancer. N Engl J Med. 2015;373:2336–2346. doi: 10.1056/NEJMoa1508054. - DOI - PMC - PubMed
1. Prochazkova K, Pavlikova K, Minarik M, Sumerauer D, Kodet R, Sedlacek Z. Somatic TP53 mutation mosaicism in a patient with Li-Fraumeni syndrome. Am J Med Genet A. 2009;149A:206–211. doi: 10.1002/ajmg.a.32574. - DOI - PubMed
1. Genovese G, Kahler AK, Handsaker RE, Lindberg J, Rose SA, Bakhoum SF, Chambert K, Mick E, Neale BM, Fromer M, et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N Engl J Med. 2014;371:2477–2487. doi: 10.1056/NEJMoa1409405. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Analysis of error profiles in deep next-generation sequencing data

Affiliations

Analysis of error profiles in deep next-generation sequencing data

Authors

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous