Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May;55(5):871-879.
doi: 10.1038/s41588-023-01376-0. Epub 2023 Apr 27.

Single duplex DNA sequencing with CODEC detects mutations with high sensitivity

Affiliations

Single duplex DNA sequencing with CODEC detects mutations with high sensitivity

Jin H Bae et al. Nat Genet. 2023 May.

Abstract

Detecting mutations from single DNA molecules is crucial in many fields but challenging. Next-generation sequencing (NGS) affords tremendous throughput but cannot directly sequence double-stranded DNA molecules ('single duplexes') to discern the true mutations on both strands. Here we present Concatenating Original Duplex for Error Correction (CODEC), which confers single duplex resolution to NGS. CODEC affords 1,000-fold higher accuracy than NGS, using up to 100-fold fewer reads than duplex sequencing. CODEC revealed mutation frequencies of 2.72 × 10-8 in sperm of a 39-year-old individual, and somatic mutations acquired with age in blood cells. CODEC detected genome-wide, clonal hematopoiesis mutations from single DNA molecules, single mutated duplexes from tumor genomes and liquid biopsies, microsatellite instability with 10-fold greater sensitivity and mutational signatures, and specific tumor mutations with up to 100-fold fewer reads. CODEC enables more precise genetic testing and reveals biologically significant mutations, which are commonly obscured by NGS errors.

PubMed Disclaimer

Conflict of interest statement

V.A.A., J.H.B., R.L. and G.M.M. have filed a patent application (РСТ/US2021/062966) on this method. T.R.G. has paid scientific advisory roles and equity in Dewpoint Therapeutics and Anji Pharmaceuticals, holds founder’s equity in Sherlock Biosciences, is a paid advisor to Braidwell Inc. and has research funding from Bayer HealthCare, Calico Life Sciences and Novo Holdings. J.E.S. is the key opinion leader for ForTec Medical. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of CODEC.
a, To distinguish real mutations from damaged bases or polymerase errors, CODEC physically links Watson and Crick strands of each original duplex, which may have an alteration confined to one strand. Each cluster reads an NGS library molecule with sequences of both strands to trace a whole duplex, hence single duplex sequencing. b, CODEC uses the adapter quadruplex, which is prepackaged with all of the components needed for Illumina NGS, followed by strand displacing extension. Unlike standard NGS, CODEC can read outward to sequence a UMI, an index and an insert together. Each NGS read pair becomes self-sufficient for forming a consensus between two strands of an original duplex. c, CODEC is compatible with both targeted sequencing and WGS.
Fig. 2
Fig. 2. Proof of concept.
a, Residual SNV frequencies of CODEC, duplex sequencing and other consensus methods such as paired-end reads consensus (R1 + R2) and SSC. Target enrichment with a pan-cancer gene panel was performed on cfDNA of two individuals. Duplex sequencing required at least two reads of each strand. b, CODEC residual SNV frequencies at each family size, which is the number of raw read pairs with the same UMI and start-stop positions. c, Recovery of unique original duplexes per targeted site in cfDNA of cancer patients and healthy donors against raw read pairs per target. d, Residual mutation frequencies and sequencing costs of different methods for WGS of the pilot genome NA12878 of the Genome in a Bottle Consortium. Because duplex sequencing WGS could not recover any duplex with the standard threshold, we had to relax it to one read of each strand only for this analysis. e, Residual SNV frequencies of WGS on human sperm with different end-repair/dA-tailing methods. a,b,d,e, Data points and error bars indicate mean values and 95% binomial confidence intervals by Wilson method, respectively.
Fig. 3
Fig. 3. Detection of rare mutations enabled by high sequencing accuracy.
a, Standard NGS can only detect high-abundance mutations detected in multiple molecules but not low-abundance mutations that are obscured by background noise. CODEC can detect both high- and low-abundance mutations due to its single duplex resolution. b, FP and FN of CODEC and standard WGS on NA12878 when downsampled to 1× to 5× (0.6–3.0× correct product depth). c, Residual somatic SNVs detected from 6× CODEC (0.47–1.02× duplex depth) paired with duplex repair and 6× standard WGS on buffy coat DNA of 15 breast cancer patients. The fitted lines and R2 values were from linear regression. d, VAF of the genome-wide mutations arising from CH discovered by CODEC (0.47–0.83× duplex depth) on cfDNA and independently validated by duplex sequencing in buffy coat. e, Mutation spectra of genome-wide CH from d. Each bar represents a TNC.
Fig. 4
Fig. 4. Detection of somatic mutations in cancer genomes.
a, Mutations detected by 2× CODEC (0.11–0.14× duplex depth) and standard WGS on eight breast tumor samples were validated by 60× standard WGS + Mutect2. Mutations on a single read were accepted. b, Sensitivity for detecting clonal tumor mutations from CODEC WGS (3× duplex depth; 18× coverage) and corresponding PPV. The data were analyzed by requiring either ≥1 or ≥2 duplexes bearing the same mutation. Standard Illumina NGS at ×18 coverage paired with Mutect2 (ref. ) achieved 98.6% sensitivity and 92.8% PPV. Theoretical numbers and projections (dashed) were calculated based on binomial models. c, Ratios and VAF of mutations initially found in CODEC WGS on tumor samples, cross-validated by targeted duplex sequencing on the same sample. Mutations exclusively detected by CODEC were grouped separately. d, Mutation spectra from 12× standard WGS with or without Mutect2 and 5× CODEC (2× duplex depth) data of a colon tumor with MSI. e, Cosine similarities against high-abundance mutations selected by Mutect2 from 12× standard WGS (orange box). Each method was downsampled to lower coverages. f, COSMIC signatures extracted from different categories of mutations. Categories under ‘discarded by Mutect2’ are subsets of corresponding categories under ‘All mutations’. g, Cosine similarities between 60× WGS + Mutect2 and either 2× CODEC or standard WGS on eight breast tumor samples. h, Pearson correlation between weights of HRD signature 3 estimated by 60× WGS + Mutect2 and either 2× CODEC or standard WGS. P values were calculated from simple linear regression with null hypothesis that slope is 0. The box shows the ground truth of HRD statuses determined by CHORD. i, Tumor mutations of four breast cancer patients tracked by performing hybridization capture on their cfDNA with personalized probe panels. Percentages indicate tumor fractions.
Fig. 5
Fig. 5. Detection of MSI.
a, Summarized residual indel frequencies at mononucleotide microsatellites of NA12878. Arrows in the magnified box indicate overall fractions of reads with incorrect microsatellite lengths. b, Residual indel frequencies at mononucleotide microsatellites with different lengths from 8 to 18 nucleotides. c, Tumor and normal samples of a colon cancer patient with MSI were sequenced and diluted in silico to simulate MSI detection across different sample ratios. MSI score indicates a sum of probabilities of being an MSI site.
Extended Data Fig. 1
Extended Data Fig. 1. Design principle of CODEC adapter quadruplex.
(a) Predicted hybridization yield of the double-stranded regions with oligonucleotide concentrations of 500 nM at 20 °C and [Na+] = 10 mM. (b) The length of single-stranded linkers was determined to mitigate bending stiffness of a target duplex. Duplexes with up to 174 bp can be accommodated without bending at all, which was calculated using the lengths of DNA in B-DNA helix and single-stranded structure. Approximately, it is 0.33 nm per base pair along the helical axis of B-DNA and 0.64 nm per nucleotide for single-stranded DNA. We excluded 3 nucleotides from each single-stranded region, which is the minimum length of a hairpin loop. (c) Read primer binding sites of standard NGS and CODEC. (d) During Illumina cluster generation cycles, early termination in the middle of the insert region could create byproducts which turn into shorter fragments with only one insert. If a read primer binding site is located at the end of a fragment, unlike CODEC, these subclonal fragments cause mixed fluorescence after sequencing cycles pass the shared region, and consequently, low Quality Scores. (e) Mean Quality Scores of each sequencing cycle by taking 42 bp before and after the shared region from random 100 read pairs. Before redesigning the adapter structure, Quality Scores suddenly dropped after the shared region. This issue was solved by moving the read primer binding regions to the linker to ‘silence’ all byproducts without the linker. (f) UMIs and each set of four sample indices are designed to collectively include all four bases at each base position while keeping similar hybridization ∆G° for high-quality image analysis of Illumina sequencers. For example, Illumina software uses up to first 25 bp for various purposes such as cluster identification, phasing correction, and chastity filter.
Extended Data Fig. 2
Extended Data Fig. 2. Byproduct analysis.
(a) Ratios of the correct CODEC product and byproducts which have been named after how they were likely created. (b) Expected mechanisms of byproduct formation. ‘Double ligation’ can occur when two adapter complexes are ligated to each end of an insert and go through T/T mismatched ligation with each other, as opposed to A/T ligation. ‘Blank ligation’ can occur when one or two adapter complexes go through T/T mismatched ligation with no insert. ‘Intermolecular’ can occur when polymerase extension uses another ligation product as a template instead of the opposite strand. (c) Cumulative fraction of sequencing coverage based on byproducts and the correct product reads. Their similarity implies that byproducts were randomly generated. (d) Medians of reads allocated to 300 bp windows grouped by their GC contents (top row) and their observed/expected ratios (bottom row). Shorter lengths of byproducts may have mitigated GC bias of polymerases. Center lines, boxes, and whiskers indicate medians, 25% and 75% percentiles, and 5% and 95% percentiles, respectively. (e) The ratio of GC-corrected read counts per 50 kb bin, normalized by the LOESS-fitted (by GC) chromosome-wide mean value. CODEC byproducts and standard NGS reads from NA12878 gDNA were analyzed by ichorCNA. CODEC byproducts showed lower normalized values than standard NGS, suggesting that there were no particular genomic regions with higher fractions of byproducts. (f) Correct product ratio and percentage of bases that passed all analysis filters vs. mean insert size of each library. Bases in byproducts were counted towards total bases, but not towards post-filtered bases.
Extended Data Fig. 3
Extended Data Fig. 3. Different types of strand consensuses.
(a) ‘No strand consensus’ treats each read of a read-pair as an independent read. ‘Single strand consensus’ is generated by collapsing multiple reads from the same strand of an original molecule, which cannot distinguish damaged bases from true mutations. ‘R1 + R2’ is a consensus between read 1 and 2 of paired-end sequencing, which both read the same library molecule from one strand of an original molecule. It does not suppress errors other than sequencing errors. (b) Residual SNV frequency per base context of targeted deep sequencing with the pan-cancer panel. Data points and error bars indicate mean values and 95% binomial confidence intervals by Wilson method, respectively.
Extended Data Fig. 4
Extended Data Fig. 4. Unique duplex recovery by CODEC and duplex sequencing.
(a) Mean unique duplex depth vs. raw read pairs per target after performing hybridization capture with personalized probe panels on cfDNA libraries of breast cancer patients and healthy donors. Samples were grouped by their mass into library construction. (b) Mean unique duplex depths of cfDNA from four healthy donors which had the same input mass. We assumed that 20 ng input had 6000 haploid copies. When a sample is cfDNA, CODEC was expected to have lower molecular complexity because it removes longer molecules with double-size selection (Methods). Center lines, boxes, and whiskers indicate medians, 25% and 75% percentiles, and 5% and 95% percentiles, respectively. (c) The effect of relaxing the threshold of duplex sequencing from two reads of each strand to one read of each strand, indirectly observed with the same data as Fig. 2a, b. Schmitt et al.. and Abascal et al.. required three and two reads of each strand, respectively.
Extended Data Fig. 5
Extended Data Fig. 5. Details of CODEC WGS and WES.
(a) Cumulative fraction of sequencing coverage of CODEC and standard WGS with matching median coverage. The curves implied that the uniformity of CODEC WGS was not as high as that of standard WGS. (b) Medians of reads allocated to 300 bp windows grouped by their GC contents (top row) and their observed/expected ratios (bottom row). CODEC may have been affected more by polymerase’s GC bias due to its longer fragment length. Center lines, boxes, and whiskers indicate medians, 25% and 75% percentiles, and 5% and 95% percentiles, respectively. (c) Overall residual SNV frequencies and their base contexts of WES on human gDNA sample. Data points and error bars indicate mean values and 95% binomial confidence intervals by Wilson method, respectively.
Extended Data Fig. 6
Extended Data Fig. 6. Details of suppressing errors at the end of DNA fragments before CODEC.
(a) Residual SNVs and their distances from fragment ends. This examples shows NGS data of a healthy donor after hybridization capture with the pan-cancer panel. We discard mutations within the last 12 bp from either end. (b) Theoretical fragment size distributions after double digestion with blunting restriction enzymes. Covered percentages show how much of human genome will turn into fragments with the size between 100 and 400 bp. The combination of HpyCH4V and AluI was selected for ddBTP-blocked ER/AT for Fig. 2e. (c) Residual SNV frequencies of CODEC paired with ddBTP-blocked ER/AT. Only using reads with a family size of one resulted in statistically the same SNV frequencies, confirming that a single read-pair is equally accurate. Data points and error bars indicate mean values and 95% binomial confidence intervals by Wilson method, respectively.
Extended Data Fig. 7
Extended Data Fig. 7. Cross-validation of the single-fragment mutations and their trinucleotide contexts.
(a) VAF of somatic mutations initially found in CODEC WGS on buffy coat DNA. The mutations were cross-validated by targeted deep sequencing using newly created duplex sequencing libraries from the same samples. Observed ratios show how much of mutations were observed again from the independent libraries. Center lines, boxes, and whiskers indicate medians, 25% and 75% percentiles, and 5% and 95% percentiles, respectively. (b) Theoretical probability of the cross-validation based on the binomial distribution. Because sampling a rare mutation in a biological sample is stochastic, somatic mutations with lower VAF are less likely to be validated. Considering most somatic mutations in buffy coat DNA aren’t under positive selection pressure and have low VAF, only a subset of mutations identified by CODEC WGS will be sampled again for the independent libraries. (c) To investigate whether CODEC contributed to any new errors, we analyzed the trinucleotide error contexts of CODEC and duplex sequencing libraries from the same individuals in Fig. 3d. For duplex sequencing, hybridization capture data were used to acquire enough mutations for the analysis. After excluding mutations detected by Mutect2 or that has ≥2 duplex reads, CODEC and duplex sequencing had 3,992 and 204 mutations, respectively. One sample with high subclonality was also removed. The four highest peaks in each figure were cytosines in CpG contexts. The proportions of SBS1, which reflects spontaneous deamination of cytosines, were 24.2% and 19.7% for CODEC and duplex sequencing, respectively.
Extended Data Fig. 8
Extended Data Fig. 8. Mutations detected in breast tumor samples.
(a) Mutations detected by CODEC and standard WGS at 2× on eight breast tumor samples were validated by 60× standard WGS + Mutect2. (b) Full COSMIC signatures of eight breast tumor samples.
Extended Data Fig. 9
Extended Data Fig. 9. Residual indel frequency.
Residual indel frequencies of CODEC and standard WGS on buffy coat DNA of 15 breast cancer patients.
Extended Data Fig. 10
Extended Data Fig. 10. CODEC with lower input masses.
Three different concentrations of adapter were tested for 0.1 and 0.01 ng input mass. We used a mixture of 67, 100, 167 bp synthetic double-stranded DNA as insert, which resulted in distinct sizes between the correct product and byproducts. To estimate the ratios of the correct product, we measured the concentration of each peak with Bioanalyzer 2100 and High Sensitivity DNA Chip.

References

    1. Lennon AM, et al. Feasibility of blood testing combined with PET-CT to screen for cancer and guide intervention. Science. 2020;369:eabb9601. - PMC - PubMed
    1. Deveson IW, et al. Evaluating the analytical validity of circulating tumor DNA sequencing assays for precision oncology. Nat. Biotechnol. 2021 doi: 10.1038/s41587-021-00857-z. - DOI - PMC - PubMed
    1. Vasan N, Baselga J, Hyman DM. A view on drug resistance in cancer. Nature. 2019;575:299–309. - PMC - PubMed
    1. Beaubier N, et al. Integrated genomic profiling expands clinical options for patients with cancer. Nat. Biotechnol. 2019;37:1351–1360. - PubMed
    1. Griffith OL, et al. The prognostic effects of somatic mutations in ER-positive breast cancer. Nat. Commun. 2018;9:3476. - PMC - PubMed

Publication types