Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs

doi:10.1093/gpbjnl/qzae024

Review

. 2024 Jul 3;22(2):qzae024.

doi: 10.1093/gpbjnl/qzae024.

Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs

Hangxing Jia¹, Shengjun Tan¹, Yong E Zhang^{1

2

3}

Affiliations

¹ CAS Key Laboratory of Zoological Systematics and Evolution & State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.
² University of Chinese Academy of Sciences, Beijing 100049, China.
³ CAS Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China.

PMID: 38991976
PMCID: PMC11423848
DOI: 10.1093/gpbjnl/qzae024

Review

Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs

Hangxing Jia et al. Genomics Proteomics Bioinformatics. 2024.

. 2024 Jul 3;22(2):qzae024.

doi: 10.1093/gpbjnl/qzae024.

Authors

Hangxing Jia¹, Shengjun Tan¹, Yong E Zhang^{1

2

3}

Affiliations

¹ CAS Key Laboratory of Zoological Systematics and Evolution & State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.
² University of Chinese Academy of Sciences, Beijing 100049, China.
³ CAS Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China.

PMID: 38991976
PMCID: PMC11423848
DOI: 10.1093/gpbjnl/qzae024

Abstract

Next-generation sequencing (NGS), represented by Illumina platforms, has been an essential cornerstone of basic and applied research. However, the sequencing error rate of 1 per 1000 bp (10-3) represents a serious hurdle for research areas focusing on rare mutations, such as somatic mosaicism or microbe heterogeneity. By examining the high-fidelity sequencing methods developed in the past decade, we summarized three major factors underlying errors and the corresponding 12 strategies mitigating these errors. We then proposed a novel framework to classify 11 preexisting representative methods according to the corresponding combinatory strategies and identified three trends that emerged during methodological developments. We further extended this analysis to eight long-read sequencing methods, emphasizing error reduction strategies. Finally, we suggest two promising future directions that could achieve comparable or even higher accuracy with lower costs in both NGS and long-read sequencing.

Keywords: Consensus sequencing; High-fidelity sequencing; Rare mutation; Sequencing error; Single-molecule sequencing.

© The Author(s) 2024. Published by Oxford University Press and Science Press on behalf of the Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.

PubMed Disclaimer

Conflict of interest statement

The authors have declared no competing interests.

Figures

**Figure 1**
Causes underlying sequencing errors and strategies reducing errors A. An overall schema of the typical Illumina sequencing process. The left, middle, and right images show DNA extraction, library construction, and sequencing, respectively. Notably, in the right image, three Illumina sequencing clusters are shown, each of which consists of PCR products from one single DNA molecule. Through sequencing by synthesis of the PCR products within each cluster, Illumina reports a cluster-level consensus as the final data. Errors (indicated by red stars) can happen in each step and can be roughly divided into three types: DNA damage, PCR-associated errors, and sequencing-associated errors. B. Two types of amplification modes. Errors are again indicated by red stars. The DNA template is marked in green, while amplified DNA products are shown in blue. In the middle image, the red dots mark the boundary of each copy. In the right image, RNA is marked in brown, while RNA polymerase is shown as an orange dot. C. Error avoidance during end repair or A-tailing. The two DNA strands are represented in green and blue, while the extended single-strand DNA is shown in brown. Errors are indicated by red stars. The central image illustrates internal (top) and terminal (bottom) errors. The left image showcases standard end repair and A-tailing, and the right image displays DNA blunting and modified A-tailing. In the left image, internal (top) and terminal (bottom) errors propagate to the complementary strand during end repair and A-tailing. In the top panel of the right image, the internal nick is extended according to the complementary strand; if ddBTP (ddGTP, ddCTP, or ddTTP; shown as a brown dot) is added, the extension stops. In the bottom panel of the right image, exonuclease is used to cut the single-strand overhang, followed by the addition of dATP. D. The sequencing error distribution of Illumina paired-end sequencing (2 × 150 bp) in libraries with short and long insert sizes. Position 0 marks the 5′ terminal of reads. This figure is modified from [27]. E. Mutation signature with or without removal of DNA damage. DNA damages induce C-to-T and G-to-T errors (marked in red), which can be removed by UDG and Fpg, respectively. This figure is modified from [23]. F. Strategies for reducing sequencing errors. Two types of strategies could be further divided into 12 subtypes. PCR, polymerase chain reaction; ddGTP, 2′,3′-dideoxyguanosine 5′-triphosphate; ddCTP, 2′,3′-dideoxycytidine 5′-triphosphate; ddTTP, 2′,3′-dideoxythymidine 5′-triphosphate; dATP, deoxyadenosine triphosphate; UDG, uracil-DNA glycosylase; Fpg, formamidopyrimidine DNA glycosylase; C, cytosine; T, thymine; G, guanine; A, adenine; dNTP, deoxyribonucleoside triphosphate; RCA, rolling circle amplification.

**Figure 2**
Schematic diagram of consensus calling across different methods A. DupSeq, BotSeqS, META-CS, and NanoSeq. These methods adopt two rounds of duplex consensus calling, for both of which amplification bias and sequencing randomness may lead to failure of consensus sequence generation. B. CypherSeq, SMM-seq, and LIANTI. Only one round of consensus calling was performed. C. CircSeq. D. o2n-seq. E. CODEC. CircSeq, o2n-seq, and CODEC rely on the linked copies to improve data efficiency. As shown in Figure 1B, the red dots show the boundary of each copy. CircSeq generates multiple copies for one DNA molecule, while o2n-seq and CODEC only generate two copies. Notably, CircSeq used 250 bp paired-end sequencing in which one read could be long enough to cover one fragment more than one time. F. PECC-Seq. With amplification-free library preparation and consensus calling by overlapping reads, PECC-Seq reaches a middle data efficiency. For some methods (*e.g*., DupSeq), read grouping is achieved through barcodes and/or mapping positions, while for the other methods, grouping is guided by only mapping positions (see also Table 1). In (D) and (F), validation or overlapping between paired reads is also demonstrated. For the consensus calling process, the light blue triangle represents the consensus sequence generated from single-strand template DNA, while the light brown triangle represents the consensus sequence generated from double-strand template DNA. DupSeq, duplex sequencing; BotSeqS, bottleneck sequencing system; META-CS, multiplexed end-tagging amplification of complementary strands; NanoSeq, nanorate sequencing; CircSeq, circle sequencing; SMM-seq, single-molecule mutation sequencing; CODEC, Concatenating Original Duplex for Error Correction; LIANTI, Linear Amplification via Transposon Insertion; PECC-Seq, Paired-End and Complementary Consensus Sequencing.

**Figure 3**
Schematic diagram of high-fidelity long-read sequencing methods A. PacBio HiFi sequencing. HiFi sequencing applies amplification-free library preparation, RCA consensus sequencing, and double-strand correction strategies to generate the final HiFi read (in red). B. PacBio HiDEF-seq. HiDEF-seq implements two additional strategies including error blockage from end repair and A-tailing and minimization of DNA damages. It also applies a DupSeq-like single-strand and double-strand consensus calling framework (Figure 2A). In (A) and (B), a single DNA molecule is circularized via two hairpin adaptors where the two strands are marked in green and blue, respectively. These strands would be presented as linked multicopy sequences in one amplicon. The yellow dot marks DNA polymerase. C. ONT 2D sequencing. D. ONT 1D² sequencing. E. ONT R10 sequencing. F. ONT duplex sequencing. The motor protein is depicted as a filled pink circle, whereas the sequencing readers are represented as pairs of dark red rectangles. Notably, both ONT R10 and duplex sequencing employ two readers (dual pinch points). G. ONT INC-Seq (upper) and R2C2 (lower). Both methods use RCA to amplify a circular DNA to obtain multiple linked copies within one amplicon. With a dedicated DNA splint for ligation, R2C2 has a higher DNA circularization efficiency than INC-Seq. The red dot marks the boundary between different copies. PacBio, Pacific Biosciences; HiFi, high-fidelity; HiDEF-seq, Hairpin Duplex Enhanced Fidelity Sequencing; ONT, Oxford Nanopore Technologies; INC-Seq, Intramolecular-ligated Nanopore Consensus Sequencing; R2C2, Rolling Circle Amplification to Concatemeric Consensus.

See this image and copyright information in PMC

Cited by

5-Hydroxymethylcytosine modifications in circulating cell-free DNA: frontiers of cancer detection, monitoring, and prognostic evaluation.
Song D, Zhang Z, Zheng J, Zhang W, Cai J. Song D, et al. Biomark Res. 2025 Mar 7;13(1):39. doi: 10.1186/s40364-025-00751-9. Biomark Res. 2025. PMID: 40055844 Free PMC article. Review.
Unlocking the Potential of Metagenomics with the PacBio High-Fidelity Sequencing Technology.
Han Y, He J, Li M, Peng Y, Jiang H, Zhao J, Li Y, Deng F. Han Y, et al. Microorganisms. 2024 Dec 2;12(12):2482. doi: 10.3390/microorganisms12122482. Microorganisms. 2024. PMID: 39770685 Free PMC article. Review.
Advancing genome-based precision medicine: a review on machine learning applications for rare genetic disorders.
Abbas SR, Abbas Z, Zahir A, Lee SW. Abbas SR, et al. Brief Bioinform. 2025 Jul 2;26(4):bbaf329. doi: 10.1093/bib/bbaf329. Brief Bioinform. 2025. PMID: 40668553 Free PMC article. Review.

References

1. Shendure J, Ji H.. Next-generation DNA sequencing. Nat Biotechnol 2008;26:1135–45. - PubMed
1. Zavodna M, Bagshaw A, Brauning R, Gemmell NJ.. The accuracy, feasibility and challenges of sequencing short tandem repeats using next-generation sequencing platforms. PLoS One 2014;9:e113862. - PMC - PubMed
1. Ewing B, Green P.. Base-calling of automated sequencer traces using phred. II. error probabilities. Genome Res 1998;8:186–94. - PubMed
1. Ewing B, Hillier L, Wendl MC, Green P.. Base-calling of automated sequencer traces using phred. I. accuracy assessment. Genome Res 1998;8:175–85. - PubMed
1. Salk JJ, Schmitt MW, Loeb LA.. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet 2018;19:269–85. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems
Miscellaneous
- NCI CPTAC Assay Portal

[1] Shendure J, Ji H.. Next-generation DNA sequencing. Nat Biotechnol 2008;26:1135–45. - PubMed

[2] Shendure J, Ji H.. Next-generation DNA sequencing. Nat Biotechnol 2008;26:1135–45. - PubMed

[3] Zavodna M, Bagshaw A, Brauning R, Gemmell NJ.. The accuracy, feasibility and challenges of sequencing short tandem repeats using next-generation sequencing platforms. PLoS One 2014;9:e113862. - PMC - PubMed

[4] Zavodna M, Bagshaw A, Brauning R, Gemmell NJ.. The accuracy, feasibility and challenges of sequencing short tandem repeats using next-generation sequencing platforms. PLoS One 2014;9:e113862. - PMC - PubMed

[5] Ewing B, Green P.. Base-calling of automated sequencer traces using phred. II. error probabilities. Genome Res 1998;8:186–94. - PubMed

[6] Ewing B, Green P.. Base-calling of automated sequencer traces using phred. II. error probabilities. Genome Res 1998;8:186–94. - PubMed

[7] Ewing B, Hillier L, Wendl MC, Green P.. Base-calling of automated sequencer traces using phred. I. accuracy assessment. Genome Res 1998;8:175–85. - PubMed

[8] Ewing B, Hillier L, Wendl MC, Green P.. Base-calling of automated sequencer traces using phred. I. accuracy assessment. Genome Res 1998;8:175–85. - PubMed

[9] Salk JJ, Schmitt MW, Loeb LA.. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet 2018;19:269–85. - PMC - PubMed

[10] Salk JJ, Schmitt MW, Loeb LA.. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet 2018;19:269–85. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs

Affiliations

Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous