Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Jul 3;22(2):qzae024.
doi: 10.1093/gpbjnl/qzae024.

Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs

Affiliations
Review

Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs

Hangxing Jia et al. Genomics Proteomics Bioinformatics. .

Abstract

Next-generation sequencing (NGS), represented by Illumina platforms, has been an essential cornerstone of basic and applied research. However, the sequencing error rate of 1 per 1000 bp (10-3) represents a serious hurdle for research areas focusing on rare mutations, such as somatic mosaicism or microbe heterogeneity. By examining the high-fidelity sequencing methods developed in the past decade, we summarized three major factors underlying errors and the corresponding 12 strategies mitigating these errors. We then proposed a novel framework to classify 11 preexisting representative methods according to the corresponding combinatory strategies and identified three trends that emerged during methodological developments. We further extended this analysis to eight long-read sequencing methods, emphasizing error reduction strategies. Finally, we suggest two promising future directions that could achieve comparable or even higher accuracy with lower costs in both NGS and long-read sequencing.

Keywords: Consensus sequencing; High-fidelity sequencing; Rare mutation; Sequencing error; Single-molecule sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors have declared no competing interests.

Figures

Figure 1
Figure 1
Causes underlying sequencing errors and strategies reducing errors A. An overall schema of the typical Illumina sequencing process. The left, middle, and right images show DNA extraction, library construction, and sequencing, respectively. Notably, in the right image, three Illumina sequencing clusters are shown, each of which consists of PCR products from one single DNA molecule. Through sequencing by synthesis of the PCR products within each cluster, Illumina reports a cluster-level consensus as the final data. Errors (indicated by red stars) can happen in each step and can be roughly divided into three types: DNA damage, PCR-associated errors, and sequencing-associated errors. B. Two types of amplification modes. Errors are again indicated by red stars. The DNA template is marked in green, while amplified DNA products are shown in blue. In the middle image, the red dots mark the boundary of each copy. In the right image, RNA is marked in brown, while RNA polymerase is shown as an orange dot. C. Error avoidance during end repair or A-tailing. The two DNA strands are represented in green and blue, while the extended single-strand DNA is shown in brown. Errors are indicated by red stars. The central image illustrates internal (top) and terminal (bottom) errors. The left image showcases standard end repair and A-tailing, and the right image displays DNA blunting and modified A-tailing. In the left image, internal (top) and terminal (bottom) errors propagate to the complementary strand during end repair and A-tailing. In the top panel of the right image, the internal nick is extended according to the complementary strand; if ddBTP (ddGTP, ddCTP, or ddTTP; shown as a brown dot) is added, the extension stops. In the bottom panel of the right image, exonuclease is used to cut the single-strand overhang, followed by the addition of dATP. D. The sequencing error distribution of Illumina paired-end sequencing (2 × 150 bp) in libraries with short and long insert sizes. Position 0 marks the 5′ terminal of reads. This figure is modified from [27]. E. Mutation signature with or without removal of DNA damage. DNA damages induce C-to-T and G-to-T errors (marked in red), which can be removed by UDG and Fpg, respectively. This figure is modified from [23]. F. Strategies for reducing sequencing errors. Two types of strategies could be further divided into 12 subtypes. PCR, polymerase chain reaction; ddGTP, 2′,3′-dideoxyguanosine 5′-triphosphate; ddCTP, 2′,3′-dideoxycytidine 5′-triphosphate; ddTTP, 2′,3′-dideoxythymidine 5′-triphosphate; dATP, deoxyadenosine triphosphate; UDG, uracil-DNA glycosylase; Fpg, formamidopyrimidine DNA glycosylase; C, cytosine; T, thymine; G, guanine; A, adenine; dNTP, deoxyribonucleoside triphosphate; RCA, rolling circle amplification.
Figure 2
Figure 2
Schematic diagram of consensus calling across different methods A. DupSeq, BotSeqS, META-CS, and NanoSeq. These methods adopt two rounds of duplex consensus calling, for both of which amplification bias and sequencing randomness may lead to failure of consensus sequence generation. B. CypherSeq, SMM-seq, and LIANTI. Only one round of consensus calling was performed. C. CircSeq. D. o2n-seq. E. CODEC. CircSeq, o2n-seq, and CODEC rely on the linked copies to improve data efficiency. As shown in Figure 1B, the red dots show the boundary of each copy. CircSeq generates multiple copies for one DNA molecule, while o2n-seq and CODEC only generate two copies. Notably, CircSeq used 250 bp paired-end sequencing in which one read could be long enough to cover one fragment more than one time. F. PECC-Seq. With amplification-free library preparation and consensus calling by overlapping reads, PECC-Seq reaches a middle data efficiency. For some methods (e.g., DupSeq), read grouping is achieved through barcodes and/or mapping positions, while for the other methods, grouping is guided by only mapping positions (see also Table 1). In (D) and (F), validation or overlapping between paired reads is also demonstrated. For the consensus calling process, the light blue triangle represents the consensus sequence generated from single-strand template DNA, while the light brown triangle represents the consensus sequence generated from double-strand template DNA. DupSeq, duplex sequencing; BotSeqS, bottleneck sequencing system; META-CS, multiplexed end-tagging amplification of complementary strands; NanoSeq, nanorate sequencing; CircSeq, circle sequencing; SMM-seq, single-molecule mutation sequencing; CODEC, Concatenating Original Duplex for Error Correction; LIANTI, Linear Amplification via Transposon Insertion; PECC-Seq, Paired-End and Complementary Consensus Sequencing.
Figure 3
Figure 3
Schematic diagram of high-fidelity long-read sequencing methods A. PacBio HiFi sequencing. HiFi sequencing applies amplification-free library preparation, RCA consensus sequencing, and double-strand correction strategies to generate the final HiFi read (in red). B. PacBio HiDEF-seq. HiDEF-seq implements two additional strategies including error blockage from end repair and A-tailing and minimization of DNA damages. It also applies a DupSeq-like single-strand and double-strand consensus calling framework (Figure 2A). In (A) and (B), a single DNA molecule is circularized via two hairpin adaptors where the two strands are marked in green and blue, respectively. These strands would be presented as linked multicopy sequences in one amplicon. The yellow dot marks DNA polymerase. C. ONT 2D sequencing. D. ONT 1D2 sequencing. E. ONT R10 sequencing. F. ONT duplex sequencing. The motor protein is depicted as a filled pink circle, whereas the sequencing readers are represented as pairs of dark red rectangles. Notably, both ONT R10 and duplex sequencing employ two readers (dual pinch points). G. ONT INC-Seq (upper) and R2C2 (lower). Both methods use RCA to amplify a circular DNA to obtain multiple linked copies within one amplicon. With a dedicated DNA splint for ligation, R2C2 has a higher DNA circularization efficiency than INC-Seq. The red dot marks the boundary between different copies. PacBio, Pacific Biosciences; HiFi, high-fidelity; HiDEF-seq, Hairpin Duplex Enhanced Fidelity Sequencing; ONT, Oxford Nanopore Technologies; INC-Seq, Intramolecular-ligated Nanopore Consensus Sequencing; R2C2, Rolling Circle Amplification to Concatemeric Consensus.

Similar articles

Cited by

References

    1. Shendure J, Ji H.. Next-generation DNA sequencing. Nat Biotechnol 2008;26:1135–45. - PubMed
    1. Zavodna M, Bagshaw A, Brauning R, Gemmell NJ.. The accuracy, feasibility and challenges of sequencing short tandem repeats using next-generation sequencing platforms. PLoS One 2014;9:e113862. - PMC - PubMed
    1. Ewing B, Green P.. Base-calling of automated sequencer traces using phred. II. error probabilities. Genome Res 1998;8:186–94. - PubMed
    1. Ewing B, Hillier L, Wendl MC, Green P.. Base-calling of automated sequencer traces using phred. I. accuracy assessment. Genome Res 1998;8:175–85. - PubMed
    1. Salk JJ, Schmitt MW, Loeb LA.. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet 2018;19:269–85. - PMC - PubMed

MeSH terms

LinkOut - more resources