. 2020 Mar 17;21(1):71.

doi: 10.1186/s13059-020-01988-3.

Benchmarking of computational error-correction methods for next-generation sequencing data

Keith Mitchell¹, Jaqueline J Brito², Igor Mandric^{1

3}, Qiaozhen Wu⁴, Sergey Knyazev³, Sei Chang¹, Lana S Martin², Aaron Karlsberg², Ekaterina Gerasimov³, Russell Littman⁵, Brian L Hill¹, Nicholas C Wu⁶, Harry Taegyun Yang¹, Kevin Hsieh¹, Linus Chen¹, Eli Littman¹, Taylor Shabani¹, German Enik¹, Douglas Yao⁷, Ren Sun⁸, Jan Schroeder⁹, Eleazar Eskin¹, Alex Zelikovsky^{3

10}, Pavel Skums³, Mihai Pop¹¹, Serghei Mangul¹²

Affiliations

¹ Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA.
² Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA.
³ Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA.
⁴ Department of Mathematics, University of California Los Angeles, 520 Portola Plaza, Los Angeles, CA, 90095, USA.
⁵ UCLA Bioinformatics, 621 Charles E Young Dr S, Los Angeles, CA, 90024, USA.
⁶ Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA.
⁷ Department of Molecular, Cell, and Developmental Biology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA.
⁸ Department of Molecular and Medical Pharmacology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA.
⁹ Epigenetics & Reprogramming Laboratory, Monash University, 15 Innovation Walk, Melbourne, VIC, 3800, Australia.
¹⁰ The Laboratory of Bioinformatics, I.M, Sechenov First Moscow State Medical University, Moscow, Russia, 119991.
¹¹ Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, USA.
¹² Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA. serghei.mangul@gmail.com.

PMID: 32183840
PMCID: PMC7079412
DOI: 10.1186/s13059-020-01988-3

Benchmarking of computational error-correction methods for next-generation sequencing data

Keith Mitchell et al. Genome Biol. 2020.

. 2020 Mar 17;21(1):71.

doi: 10.1186/s13059-020-01988-3.

Authors

Affiliations

¹ Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA.
² Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA.
³ Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA.
⁴ Department of Mathematics, University of California Los Angeles, 520 Portola Plaza, Los Angeles, CA, 90095, USA.
⁵ UCLA Bioinformatics, 621 Charles E Young Dr S, Los Angeles, CA, 90024, USA.
⁶ Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA.
⁷ Department of Molecular, Cell, and Developmental Biology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA.
⁸ Department of Molecular and Medical Pharmacology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA.
⁹ Epigenetics & Reprogramming Laboratory, Monash University, 15 Innovation Walk, Melbourne, VIC, 3800, Australia.
¹⁰ The Laboratory of Bioinformatics, I.M, Sechenov First Moscow State Medical University, Moscow, Russia, 119991.
¹¹ Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, USA.
¹² Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA. serghei.mangul@gmail.com.

PMID: 32183840
PMCID: PMC7079412
DOI: 10.1186/s13059-020-01988-3

Abstract

Background: Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.

Results: In this paper, we evaluate the ability of error correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error-correction methods.

Conclusions: In terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Study design for benchmarking computational error-correction methods. a Schematic representation of the goal of error correction algorithms. Error correction aims to fix sequencing errors while maintaining the data heterogeneity. b Error-free reads for gold standard were generated using UMI-based clustering. Reads were grouped based on matching UMIs and corrected by consensus, where an 80% majority was required to correct sequencing errors without affecting naturally occurring single nucleotide variations (SNVs). c Framework for evaluating the accuracy of error-correction methods. Multiple sequence alignment between the error-free, uncorrected (original), and corrected reads was performed to classify bases in the corrected read. Bases fall into the category of trimming, true negative (TN), true positive (TP), false negative (FN), and false positive (FP)

**Fig. 2**
Correcting errors in whole genome sequencing data (D1 dataset). For each tool, the best k-mer size was selected. a–f WGS human data. g–l WGS *E. coli* data. a, g Heatmap depicting the gain across various coverage settings. Each row corresponds to an error correction tool, and each column corresponds to a dataset with a given coverage. b, h Heatmap depicting the precision across various coverage settings. Each row corresponds to an error correction tool, and each column corresponds to a dataset with a given coverage. c, i Heatmap depicting the sensitivity across various coverage settings. Each row corresponds to an error correction tool, and each column corresponds to a dataset with a given coverage. d, j Scatter plot depicting the number of TP corrections (x-axis) and FP corrections (y-axis) for datasets with 32x coverage. e, k Scatter plot depicting the number of FP corrections (x-axis) and FN corrections (y-axis) for datasets with 32x coverage. f, l Scatter plot depicting the sensitivity (x-axis) and precision (y-axis) for datasets with 32x coverage

**Fig. 3**
Correcting errors in TCR-Seq data (D2 dataset). For all plots, the mean value across 8 TCR-Seq samples is reported for each tool. a Bar plot depicting the gain across various error-correction methods. b Scatter plot depicting the number of TP corrections (x-axis) and FP corrections (y-axis). c Scatter plot depicting the number of FP corrections (x-axis) and FN corrections (y-axis). d Scatter plot depicting the sensitivity (x-axis) and precision (y-axis) of each tool

**Fig. 4**
Correcting errors in viral sequencing data (D4 dataset). For all plots, the best k-mer size was selected. a Bar plot depicting the gain across various error-correction methods. b Scatter plot depicting the sensitivity (x-axis) and precision (y-axis) of each tool

See this image and copyright information in PMC

References

1. Schuster SC. Next-generation sequencing transforms today’s biology. Nat Methods. 2008;5:16–18. doi: 10.1038/nmeth1156. - DOI - PubMed
1. Scholz MB, Lo C-C, Chain PSG. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr Opin Biotechnol. 2012;23:9–15. doi: 10.1016/j.copbio.2011.11.013. - DOI - PubMed
1. Salk JJ, Schmitt MW, Loeb LA. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet. 2018;19:269–285. doi: 10.1038/nrg.2017.117. - DOI - PMC - PubMed
1. Ma X, et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019;20:50. doi: 10.1186/s13059-019-1659-6. - DOI - PMC - PubMed
1. Strom SP. Current practices and guidelines for clinical next-generation sequencing oncology testing. Cancer Biol Med. 2016;13:3–11. doi: 10.20892/j.issn.2095-3941.2016.0004. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Associated data

figshare/10.6084/m9.figshare.11776413

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Benchmarking of computational error-correction methods for next-generation sequencing data

Affiliations

Benchmarking of computational error-correction methods for next-generation sequencing data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources