. 2024 Jun;630(8017):752-761.

doi: 10.1038/s41586-024-07532-8. Epub 2024 Jun 12.

DNA mismatch and damage patterns revealed by single-molecule sequencing

Mei Hong Liu^#^{1

2}, Benjamin M Costa^#^{1

2}, Emilia C Bianchini^{1

2}, Una Choi^{1

2}, Rachel C Bandler¹, Emilie Lassen³, Marta Grońska-Pęski^{1

2}, Adam Schwing^{1

2}, Zachary R Murphy^{1

2}, Daniel Rosenkjær³, Shany Picciotto⁴, Vanessa Bianchi⁵, Lucie Stengs⁵, Melissa Edwards⁵, Nuno Miguel Nunes⁵, Caitlin A Loh^{1

2}, Tina K Truong^{1

2}, Randall E Brand⁶, Tomi Pastinen⁷, J Richard Wagner⁸, Anne-Bine Skytte³, Uri Tabori^{5

9}, Jonathan E Shoag⁴, Gilad D Evrony^{10

11}

Affiliations

¹ Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA.
² Department of Pediatrics, Department of Neuroscience & Physiology, Institute for Systems Genetics, Perlmutter Cancer Center, and Neuroscience Institute, New York University Grossman School of Medicine, New York, NY, USA.
³ Cryos International Sperm and Egg Bank, Aarhus, Denmark.
⁴ Department of Urology, University Hospitals Cleveland Medical Center, Case Western Reserve University School of Medicine, Cleveland, OH, USA.
⁵ Program in Genetics and Genome Biology, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, Toronto, Ontario, Canada.
⁶ Department of Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.
⁷ Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA.
⁸ Department of Nuclear Medicine and Radiobiology, Université de Sherbrooke, Sherbrooke, Quebec, Canada.
⁹ Division of Haematology/Oncology, Arthur and Sonia Labatt Brain Tumour Research Centre, The Hospital for Sick Children, Toronto, Ontario, Canada.
¹⁰ Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA. gilad.evrony@nyulangone.org.
¹¹ Department of Pediatrics, Department of Neuroscience & Physiology, Institute for Systems Genetics, Perlmutter Cancer Center, and Neuroscience Institute, New York University Grossman School of Medicine, New York, NY, USA. gilad.evrony@nyulangone.org.

^# Contributed equally.

PMID: 38867045
PMCID: PMC11216816
DOI: 10.1038/s41586-024-07532-8

DNA mismatch and damage patterns revealed by single-molecule sequencing

Mei Hong Liu et al. Nature. 2024 Jun.

. 2024 Jun;630(8017):752-761.

doi: 10.1038/s41586-024-07532-8. Epub 2024 Jun 12.

Authors

Affiliations

¹ Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA.
² Department of Pediatrics, Department of Neuroscience & Physiology, Institute for Systems Genetics, Perlmutter Cancer Center, and Neuroscience Institute, New York University Grossman School of Medicine, New York, NY, USA.
³ Cryos International Sperm and Egg Bank, Aarhus, Denmark.
⁴ Department of Urology, University Hospitals Cleveland Medical Center, Case Western Reserve University School of Medicine, Cleveland, OH, USA.
⁵ Program in Genetics and Genome Biology, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, Toronto, Ontario, Canada.
⁶ Department of Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.
⁷ Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA.
⁸ Department of Nuclear Medicine and Radiobiology, Université de Sherbrooke, Sherbrooke, Quebec, Canada.
⁹ Division of Haematology/Oncology, Arthur and Sonia Labatt Brain Tumour Research Centre, The Hospital for Sick Children, Toronto, Ontario, Canada.
¹⁰ Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA. gilad.evrony@nyulangone.org.
¹¹ Department of Pediatrics, Department of Neuroscience & Physiology, Institute for Systems Genetics, Perlmutter Cancer Center, and Neuroscience Institute, New York University Grossman School of Medicine, New York, NY, USA. gilad.evrony@nyulangone.org.

^# Contributed equally.

PMID: 38867045
PMCID: PMC11216816
DOI: 10.1038/s41586-024-07532-8

Abstract

Mutations accumulate in the genome of every cell of the body throughout life, causing cancer and other diseases^1,2. Most mutations begin as nucleotide mismatches or damage in one of the two strands of the DNA before becoming double-strand mutations if unrepaired or misrepaired^3,4. However, current DNA-sequencing technologies cannot accurately resolve these initial single-strand events. Here we develop a single-molecule, long-read sequencing method (Hairpin Duplex Enhanced Fidelity sequencing (HiDEF-seq)) that achieves single-molecule fidelity for base substitutions when present in either one or both DNA strands. HiDEF-seq also detects cytosine deamination-a common type of DNA damage-with single-molecule fidelity. We profiled 134 samples from diverse tissues, including from individuals with cancer predisposition syndromes, and derive from them single-strand mismatch and damage signatures. We find correspondences between these single-strand signatures and known double-strand mutational signatures, which resolves the identity of the initiating lesions. Tumours deficient in both mismatch repair and replicative polymerase proofreading show distinct single-strand mismatch patterns compared to samples that are deficient in only polymerase proofreading. We also define a single-strand damage signature for APOBEC3A. In the mitochondrial genome, our findings support a mutagenic mechanism occurring primarily during replication. As double-strand DNA mutations are only the end point of the mutation process, our approach to detect the initiating single-strand events at single-molecule resolution will enable studies of how mutations arise in a variety of contexts, especially in cancer and ageing.

PubMed Disclaimer

Conflict of interest statement

Competing Interests

A patent application for HiDEF-seq has been filed (G.D.E.). G.D.E. owns equity in DNA sequencing companies (Illumina, Oxford Nanopore Technologies, and Pacific Biosciences). Other authors do not have competing interests.

Figures

**Extended Data Figure 1.. HiDEF-seq library preparation and sequencing metrics.**
a, Representative DNA sizing electropherogram after Hpy166II restriction enzyme digestion (top) and after completion of the HiDEF-seq library preparation, which removes fragments < 1 kb (bottom). b, Two-dimensional histogram of all molecules from a representative HiDEF-seq sequencing run of each molecule’s longest strand read length (bp, base pairs) versus its total polymerase read length (PRL). Dashed line signifies the expected strand length distribution. The red diagonal line reflects 18% of molecules with < 1 strand pass, which is typical in PacBio sequencing. c, Histogram (200 bp bins) for representative HiDEF-seq samples (n=51) of molecule consensus sequence lengths (i.e., molecule sizes). Line and shaded region show average and standard deviation, respectively, across samples for each bin. The average of these samples’ median lengths is 1.7 kilobases (kb). d, Histogram as in panel (c), showing HiDEF-seq (n=51 representative samples) yields smaller molecule lengths than standard PacBio (HiFi) samples (n=10 samples). The average of samples’ median lengths are 1.7 kb and 18.3 kb for HiDEF-seq and HiFi, respectively. e, Two-dimensional histogram of the number of passes (bin width of 5 passes) vs. consensus sequence lengths (bin width of 200 bp) for molecules from the 51 representative HiDEF-seq samples plotted in panels (c,d). Bins are colored if there is at least one molecule in the bin. f, Box plots of the fraction of a molecule’s consensus sequence bases (average of forward and reverse strands) that have the maximum predicted quality (quality=93, as predicted by ccs, Methods) versus the number of passes per strand, across all molecules of the same samples included in panels (c-e). Note: 93 is the quality required for HiDEF-seq analysis. This plot illustrates that the number of passes is a key determinant of consensus quality in both HiDEF-seq and HiFi. b, Plot generated by SMRT Link (Pacific Biosciences) software. c-e, The single-molecule consensus sequence length is the average of the forward and reverse strand lengths. Bin values are normalized to the bin with the highest molecule count. e,f, The number of passes per strand is the average of the forward and reverse strand ‘ec’ tags (Methods). c-f, Plots show data of HiDEF-seq molecules that are output by the primary data processing step of the HiDEF-seq analysis pipeline and standard PacBio HiFi molecules that are output by the ccs HiFi pipeline (Methods). f, Box plot: middle line, median; boxes, 1st and 3rd quartiles; whiskers, the maximum/minimum values within 1.5 x interquartile range. X-axis: square brackets and parentheses signify inclusion and exclusion of interval endpoints, respectively.

**Extended Data Figure 2.. Schematic of analysis pipeline.**
Primary data processing (blue) is followed by call filtering (green) along with germline sequencing analysis (orange), which is then followed by call burden and signature analysis (purple). See Methods for full details. On the left of primary data processing steps are the average percentage of molecules filtered by each step across 17 representative HiDEF-seq sequencing runs. Approximately half of molecules filtered by the ‘Generate consensus sequence’ step are molecules with less than 3 full-length passes (default setting of the ccs tool that creates consensus sequences), and the other half are due to molecules with read quality (‘rq’ tag) < 0.99. At the end of the call filtering steps are listed the percentage of bases filtered by all the call filtering steps, calculated out of the total bases of molecules that pass primary data processing, for the same 17 representative HiDEF-seq sequencing runs. The filter for ‘low-quality genomic regions and gnomAD variants with allele frequency (AF) > 0.1% in the population’ covers approximately 15% and 7% of the genome when using Illumina and PacBio germline sequencing data, respectively (i.e., when PacBio germline sequencing data is used, the pipeline uses less restrictive filters due to fewer genome alignment errors and artifacts). WGS, whole-genome sequencing.

**Extended Data Figure 3.. Analysis thresholds and comparison of analyses using short- versus long-read germline sequencing.**
a, Histogram of predicted consensus sequence accuracy (‘rq’ tag, bin width=0.0001) for DNA molecules that pass primary data processing steps from 3 representative sperm samples profiled by HiDEF-seq (with nick ligation) (21yo: SPM-1002; 39yo: SPM-1004; 44yo: SPM-1020; yo, years old). Note, these are consensus sequence accuracies predicted by the ccs consensus calling software (Methods), which are used to filter low-quality molecules, but these accuracies do not reflect the true accuracy that is significantly higher. b, Box plot of passes per strand for different consensus sequence accuracy bins, for molecules from the 3 samples included in the prior panel, showing that higher minimum accuracies select for molecules with higher numbers of passes. c, Fraction of post-primary data processing molecules that are filtered (left plot) and fraction of post-primary data processing base pairs that remain for interrogation (right plot) using different minimum passes per strand and consensus sequence accuracy thresholds. Values show average of the 3 samples included in the prior panels, after completing all steps of the mutation filtering pipeline. d,e, dsDNA mutation burdens for the 3 samples included in the prior panels using different minimum passes per strand and consensus sequence accuracy thresholds. Panel (e) shows data from (d) at consensus accuracy of 0.99 with Poisson 95% confidence intervals. These data illustrate stability of dsDNA mutation burden estimates at broad thresholds using sperm as the most stringent test of fidelity. f, Fraction of high-quality, known heterozygous germline variants detected using different minimum required fraction of molecule passes (i.e., subreads) that detect the variant (filter applied separately to each strand). This value is used for sensitivity correction (Methods). Values show average of the 3 samples included in prior panels. g,h, dsDNA mutation burdens for the 3 samples included in the prior panels using different minimum required fraction of molecule passes that detect the variant (filter applied separately to each strand), after correcting for sensitivity (g), and using different minimum required distances from the end of the read (h). Panel (g) illustrates that correcting for sensitivity maintains stable burden estimates. The analysis pipeline requires a minimum of 10 bp from the ends of reads to remove rare alignment artifacts, although this does not significantly alter burden estimates. i, ssDNA call burdens for the 3 sperm samples included in the prior panels using different minimum passes per strand and consensus sequence accuracy thresholds. Plot shows a small decrease in ssDNA call burdens with a higher minimum required passes per strand at low consensus sequence accuracy thresholds, and convergence to similar burdens at high consensus sequence accuracy thresholds. Data shown with minimum fraction of 0.5 molecule passes that detect the variant. j, ssDNA call burdens for the 3 sperm samples included in the prior panels using different minimum required fraction of molecule passes that detect the variant, after correcting for sensitivity. Data shown with a minimum consensus sequence accuracy of 0.999 and a minimum of 20 passes per strand. k,l, Concordant dsDNA mutation and ssDNA call burdens obtained by HiDEF-seq using short-read (Illumina) or long-read (PacBio, Pacific Biosciences) germline sequencing during analysis, for two samples (1301 and 1901 blood). **a-d**,i, Consensus sequence accuracies are the average of forward and reverse strand accuracies. b, Box plot: middle line, median; boxes, 1st and 3rd quartiles; whiskers, the maximum/minimum values within 1.5 x interquartile range. X-axis: square brackets and parentheses signify inclusion and exclusion of interval endpoints, respectively. **c-e**,i, Threshold for minimum required passes per strand is applied to both strands. **c-j**, The symbols ‡ and § mark the final thresholds chosen for dsDNA and ssDNA analyses, respectively. c,f, Error bars: standard deviation; note, panel (f) error bars are small and therefore not well visualized. **d,e,g-l**, mutation and call burdens are corrected for sensitivity and trinucleotide context opportunities of the full genome relative to interrogated bases (Methods). e,**g,h**,j-l, Dots and error bars: point estimates and their Poisson 95% confidence intervals.

**Extended Data Figure 4.. dsDNA mutation burdens of HiDEF-seq without ssDNA nick ligation and removal of ssDNA artifacts by ssDNA nick ligation.**
a, dsDNA mutation burdens in two sperm samples (left to right: SPM-1004, SPM-1020) profiled by HiDEF-seq without ssDNA nick ligation and by NanoSeq, compared for each age (yo, years old) to paternally-phased de novo mutations in children from a prior study of 2,976 trios. See Fig. 1c for sperm samples profiled by HiDEF-seq with nick ligation. b, dsDNA mutation burdens versus age, measured by HiDEF-seq without nick ligation (see Fig. 1d for samples profiled by HiDEF-seq with nick ligation). Dashed lines (liver, kidney): weighted least-squares linear regression. Dotted lines (blood, neurons): these only connect two data points to aid visualization of burden difference, since regression cannot be performed with two samples. c, Mutational signature contribution to dsDNA mutations detected in samples profiled by HiDEF-seq without nick ligation (see Extended Data Fig. 5i for samples profiled by HiDEF-seq with nick ligation). All samples, except blood from a 62-year-old individual with a history of kidney disease (1901, asterisk), were jointly analyzed with fitting of SBS1 and de novo extraction of one additional signature SBSi (Methods). The blood sample of the 62-year-old was analyzed separately together with 5 other HiDEF-seq (with nick ligation) blood samples from this individual, due to identification of an additional signature SBSii with strong and moderate similarity to SBS19 and SBS23, respectively. Analysis of samples grouped by tissue type, excluding the 62-year-old blood sample, produced similar results. For de novo extracted signatures (SBSi and SBSii), the cosine similarities to the most similar COSMIC signatures are shown in parentheses. Sperm samples and kidney and liver samples from an infant (1443) were not included here since the number of mutations is too low for reliable signature extraction. d, Burdens of dsDNA mutations (left plot) and ssDNA calls (right plot) of a blood sample (individual 1301) measured by HiDEF-seq without versus with nick ligation. Nick ligation eliminates T>A ssDNA artifacts that match the illustrated GTTBVH motif. The motif was derived using the ggseqlogo R package using all ssDNA T>A calls from the sample profiled by HiDEF-seq without nick ligation. Gray bar is calls matching the motif with log-odds score > 2 calculated with the score_match function of the universalmotif R package. e,f, Proposed mechanism for the GTTBVH motif of ssDNA artifactual calls in HiDEF-seq without ssDNA nick ligation. The known GTNNAC motif of the Hpy166II restriction enzyme used in HiDEF-seq may arise if Hpy166II operates as a dimer (cut sites signified by triangles) with each monomer binding opposite strands, and the GTTBVH motif is due to intersection (∩) and union (∪) combinatorial logic for the outer and inner 2 bases, respectively (e). Without nick ligation, ssDNA GT[T>A]BVH artifactual calls may arise from rare Hpy166II monomer nicking events, pyrophosphorolysis of the ‘T’ upstream of the nick, and addition of a mismatched ‘A’ during the Klenow dATP/ddBTP A-tailing reaction. Further extension with ddBTP does not occur due to the mismatch. This process is prevented in HiDEF-seq by nick ligation. g, Nick ligation increases HiDEF-seq library yield by 66% for post-mortem tissues, likely by repairing nicks in the original input DNA so that the molecules are not eliminated in the final nuclease treatment step. Bars show average yield for each group; number of samples per group (left to right): 8, 8, 5, 9 (**, p=0.002; ns, not significant; two-sided unpaired t-test). a, Box plots: middle line, median; boxes, 1st and 3rd quartiles; whiskers, 5% and 95% quantiles. For each sample, HiDEF-seq and NanoSeq confidence intervals were normalized to reflect an equivalent number of interrogated base pairs (Methods). a,b,d, Error bars: Poisson 95% confidence intervals. g, Error bars: standard deviation.

**Extended Data Figure 5.. HiDEF-seq without A-tailing removes ssDNA artifacts of post-mortem tissues with fragmented DNA.**
a, Fraction of ssDNA calls that are T>A (corrected for trinucleotide context opportunities) versus the ssDNA T>A burden in all samples profiled by HiDEF-seq with A-tailing (i.e., Klenow reaction +dATP/+ddBTP) from healthy individuals and cell lines (i.e., excluding cancer predisposition syndromes). Post-mortem kidney and liver consistently have the highest fraction of ssDNA calls that are T>A. b, ssDNA call spectrum for a liver sample profiled by HiDEF-seq with A-tailing exhibiting a high ssDNA T>A burden (6.8·10⁻⁷ T>A burden; 7.6·10⁻⁷ total ssDNA call burden), corrected for trinucleotide context opportunities. Parentheses show total number of calls. c, Correlation between ssDNA T>A artifact burden and the input DNA’s DNA Integrity Number measured by TapeStation electrophoresis across all samples profiled by HiDEF-seq with A-tailing from healthy individuals and cell lines (i.e., excluding cancer predisposition syndromes). Lower DNA Integrity Number corresponds to more fragmented DNA. d, Proposed mechanism for the ssDNA T>A artifact calls in fragmented DNA when performing HiDEF-seq with A-tailing and its prevention in HiDEF-seq without A-tailing. e, Modifications of the HiDEF-seq protocol to eliminate ssDNA T>A artifacts in fragmented DNA. All trials were from the same DNA extraction aliquot (liver from individual 5697). See Methods for details. PNK, polynucleotide kinase; Bst, Bst large fragment; min, minutes. f, ssDNA call spectra for three of the samples shown in panel (e): standard HiDEF-seq with A-tailing (top, same spectrum as panel (b)), HiDEF-seq with a Klenow reaction that does not contain dATP nor ddBTP (middle), and HiDEF-seq with a Klenow reaction containing only ddBTP (bottom). The total number of ssDNA calls and total ssDNA call burden (calls per base) are shown. g, Fraction of ssDNA calls that are T>A (corrected for trinucleotide context opportunities) versus the ssDNA T>A burden in post-mortem liver (n=5) and kidney (n=5) samples profiled by HiDEF-seq without A-tailing (i.e., Klenow reaction −dATP/+ddBTP). h, Concordant dsDNA mutation burdens in sperm sample SPM-1013 measured by HiDEF-seq with A-tailing (i.e., Klenow reaction +dATP/+ddBTP) and without A-tailing (i.e., Klenow reaction -dATP/+ddBTP). yo, years old. i, Mutational signature contribution to dsDNA mutations detected by HiDEF-seq in primary human tissues from individuals without cancer predisposition. Post-mortem liver and kidney samples were profiled by HiDEF-seq without A-tailing. All samples, except blood from a 62-year-old individual with a history of kidney disease (1901, asterisk), were jointly analyzed with fitting of SBS1 and de novo extraction of one additional signature SBSiii. Blood samples of the 62-year-old profiled by HiDEF-seq were analyzed separately (plot shows average signature contributions across 5 blood samples) due to identification of an additional signature SBSiv. Analysis of samples grouped by tissue type, excluding the 62-year-old blood sample, produced similar results. For de novo extracted signatures (SBSiii and SBSiv), the cosine similarities to the most similar COSMIC signatures are shown in parentheses. Sperm, kidney and liver samples from an infant (1443) and 18-year-old (1409), and blood from a 4-year-old (5203) are not included here since their number of mutations are too low for reliable signature extraction. **e,h**, Bars (e) and dots (h) show point estimates, and error bars are their Poisson 95% confidence intervals. **e-g**, Rxn, reaction.

**Extended Data Figure 6.. Comparison of HiDEF-seq and NanoSeq.**
a, Comparison of HiDEF-seq versus NanoSeq dsDNA mutation spectra for individual 63143. b, Comparison of HiDEF-seq versus NanoSeq ssDNA call burdens, separated by call type. For each call type (i.e., C>A, C>G, etc.), each bar represents a different sample. Samples for each call type, from left to right, are 1105 and 6501 for healthy blood; 63143 for POLE blood; and 1443 for kidney. Comparison for sperm samples is shown in Fig. 1g. c, Comparison of HiDEF-seq versus NanoSeq ssDNA call spectra for 6501 (Blood, 43 yo), 63143 (POLE blood), and SPM-1060 (sperm, 49 yo). **a-c**, yo, years old; mo, months old.

**Extended Data Figure 7.. dsDNA mutation burdens and patterns in cancer predisposition syndromes.**
a, Fraction of dsDNA mutations in each context. Non-cancer predisposition samples are (left to right): Blood (B) 5203, 1105, 1301, 6501, and 1901; lymphoblastoid cell line (LCL) GM12812; primary fibroblasts GM02036 and GM03348. Cancer predisposition samples are (left-to-right, in the same order and annotated sample types as top-to-bottom cancer predisposition samples in panel (c)): GM16381, GM01629, GM28257, 55838, 58801, 57627, 1400, 1324, 1325, 60603, 59637, 57615, 63143 (LCL), 63143 (B), CC-346–253, CC-388–290, CC-713–555. Affected genes annotated below. Note, GM02036 (asterisk) has a significant increase in C>T mutations with a spectrum matching COSMIC SBS7a (ultraviolet light exposure), likely due to the fibroblasts deriving from sun-exposed skin. b, Representative dsDNA mutation spectra of a sample for each affected gene, corrected for trinucleotide context opportunities. Sample IDs are in parentheses. Ages (yo, years old) are listed for blood samples. c, Fraction of dsDNA mutations attributable to de novo extracted dsDNA mutational signatures. Sample genotypes are on the right (hom., homozygous; compound heterozygous variants separated by ‘/’). In parentheses is the cosine similarity to the most similar COSMIC signature when the similarity is ≥ 0.8 (weak similarity: 0.8 – 0.85; moderate similarity: 0.85 – 0.9; strong similarity: ≥ 0.9; Methods). In *ERCC6* and *ERCC8* mutant cell lines, whose mutational patterns are unknown, we identified signature SBSB with weak similarity (cosine similarity 0.82) to the COSMIC SBS36 signature. For SBSF, the most similar COSMIC signature is SBS10c, but the cosine similarity of 0.79 is not considered significant. For SBSG, the most similar COSMIC signature is SBS40, but the cosine similarity of 0.76 is not considered significant. SBSG had non-significant similarities to SBS18 (0.69) and SBS36 (0.59), which have been previously associated with *MUTYH*. These *MUTYH* signatures were not extracted due to the normal mutation burdens of our *MUTYH* blood samples (see panel (d)), which is expected at these sample ages and our interrogated base coverage. Note that SBS40 resembles SBS18 and SBS36 in the C>A spectrum that is enriched in *MUTYH* syndrome. Signature extraction was performed for samples of each DNA repair pathway (except *XPC* separately from *ERCC6*/*ERCC8*), while simultaneously fitting COSMIC SBS1 and SBS5 (Methods). Samples are in the same top-to-bottom order as left-to-right cancer predisposition samples in panel (a). d, dsDNA mutation burden per base pair divided by the age of the individual in years at the time of blood collection, corrected for trinucleotide context opportunities and sensitivity. Only blood samples are shown, since blood can be annotated with the age of the individual. Accordingly, since we did not profile blood samples nucleotide excision repair syndrome, this category is not shown. Non-cancer predisposition blood samples are the same (left-to-right) as in panel (a) (left-to-right). Cancer predisposition blood samples are the same (left-to-right) as blood samples in panel (c) (top-to-bottom). Affected genes annotated below. e, Replication strand asymmetry based on replication timing data (Methods) of AGA>ATA ssDNA mismatches and dsDNA mutations in *POLE* PPAP samples. Reference (+) refers to the human reference genome plus strand. Non-reference (−) strand lagging and leading strand synthesis corresponds to negative and positive fork polarity values, respectively (Methods). The ‘strand ratio’ (Y-axis) is calculated as the fraction of all AGA>ATA non-reference strand events that have the specified fork polarity divided by the fraction of all AGA>ATA reference strand mutations that have the specified fork polarity (Methods). *, p = 0.015; ***, p < 10⁻¹⁵ (chi-squared test; n = 73 ssDNA AGA>ATA mismatches; n = 3,871 dsDNA AGA>ATA mutations). For dsDNA mutations, bars show the average across PPAP samples (n=4), and for ssDNA mismatches, due to their low number, bars show a single estimate for calls pooled across PPAP samples. See (f) for analysis of dsDNA mutations separated by fork polarity quantiles (rather than positive versus negative polarity), which cannot be plotted for ssDNA mismatches due to the low number of ssDNA mismatches per quantile. ssDNA strand ratios were calculated using calls of all *POLE* PPAP samples, since there are too few calls to reliably analyze individual samples. dsDNA strand ratios were calculated separately for each sample (plot shows average and standard deviation). Excluding calls overlapping genes to exclude transcription strand biases was still significant for dsDNA mutations (p < 10⁻¹⁵) but not ssDNA mismatches, but the latter had significantly reduced power due to a 55% reduction in the number of analyzed ssDNA calls. f, Replication strand asymmetry of AGA>ATA dsDNA mutations in *POLE* PPAP samples calculated for each fork polarity quantile. Fork polarity quantiles divide fork polarity values into 9 quantile bins from 0 to 1, with higher values corresponding to a greater probability of the non-reference strand being replicated in the leading rather than lagging strand direction (Methods). Random loci are the average of 50 sets of 1,000 random genomic loci with either the sequence AGA or TCT for which there is replication timing data at the locus. The ‘strand ratio’ is calculated for *POLE* PPAP samples as in (e), and it is calculated for random genomic loci as the fraction of all AGA non-reference strand loci that are in the fork polarity quantile bin divided by the fraction of all AGA reference strand loci that are in the fork polarity quantile bin. PPAP samples are the same top-to-bottom order in the legend as top-to-bottom PPAP samples in (c). Asterisks signify statistical significance in comparison of the *POLE* PPAP 4-sample average (dashed line) to random loci (heteroscedastic two-tailed t.test); p-values left-to-right for asterisks: 3.7·10^–17, 0.001, 0.009, 0.02, 0.003. Excluding mutations overlapping genes to exclude transcription strand biases produced similar results (p=3.1·10⁻¹⁰, 0.003, and 0.04 for quantiles 0–0.1, 0.1–0.2, and 0.6–0.7, respectively), but this analysis has reduced power due to the 55% reduction in the number of mutations. **a-f**, See additional samples details in Supplementary Tables 1–4. **e,f**, Error bars: standard deviation.

**Extended Data Figure 8.. Hypermutating tumors deficient in both mismatch repair and polymerase proofreading.**
a, Burdens of dsDNA mutations (left) and ssDNA calls (right). Burdens are corrected for trinucleotide context opportunities and detection sensitivity (Methods). **b,c**, Fraction of dsDNA mutation burdens (b) and ssDNA call burdens (c) by context, corrected for trinucleotide context opportunities. d, ssDNA mismatch signature SBS14ss extracted from tumor samples, while simultaneously fitting SBS30ss*. e, Fraction of dsDNA mutations attributed to each dsDNA signature. Cosine similarity of the extracted signature SBSH to the most similar COSMIC SBS signature is shown in parentheses. Cosine similarities of original spectra of samples to spectra reconstructed from component signatures are (left to right): 0.94 and 0.998. f, Fraction of ssDNA calls attributed to each ssDNA signature. Cosine similarities of original spectra of samples to spectra reconstructed from component signatures are (left to right): 0.91 and 0.98. a, Dots and error bars: point estimates and their Poisson 95% confidence intervals. **a-c,e,f**, MB, medulloblastoma (ID: Tumor 8); GBM, glioblastoma (ID: Tumor 10). See Supplementary Table 1 for sample details.

**Extended Data Figure 9.. Burdens of ssDNA C>T calls, kinetic interpulse duration profiles, and profiling of heat treatment in varied buffers.**
a, Fraction of ssDNA calls that are C>T (corrected for trinucleotide context opportunities) across all HiDEF-seq samples from healthy individuals and cell lines (i.e., excluding cancer predisposition syndromes), versus the ssDNA C>T burden. Data shown for liver and kidney samples profiled by HiDEF-seq without A-tailing. Sperm consistently have the highest fraction of ssDNA calls that are C>T. LCL, lymphoblastoid cell line. b, Cosine similarity of ssDNA call spectra to SBS30 after projecting ssDNA spectra to central pyrimidine contexts. c, Average ratio of pulse widths (left) and interpulse durations (right) at C>T calls and 30 flanking bases relative to molecules aligning to the same locus without the call (sperm: n=1799 calls; blood DNA 72C heat, 3 and 6 hours: n=626 calls; dsDNA C>T mutations in a larger set of non-heat treated blood DNA, 56C and 72C heat treated blood DNA, sperm, kidney, and liver samples: n=1202 mutations; Methods). Positions +1 and +3 (stars) best discriminate ssDNA C>T damage from dsDNA C>T mutations. Yellow box is the span shown in Fig. 4f. d, Average ratio of pulse width (left column) and interpulse duration (right column) after randomizing labels of molecules with and without the calls, for the same samples and calls as in Extended Data Fig. 9c. e, dsDNA mutation and ssDNA call burdens of heat-treated blood DNA in an additional experiment testing the effect of different buffers and different DNA extraction methods (orange underline, Puregene alcohol precipitation; all other samples, MagAttract with magnetic beads). MgAc, magnesium acetate; MgCl₂, magnesium chloride; KCl, potassium chloride; KAc, potassium acetate; Alb, albumin; Tris buffer is Tris-HCl except for the MgAc/KAc/Alb that is Tris-Acetate (see Supplementary Table 1 for concentrations). Non-heat treated DNA samples were placed on ice for 6 hours. The percentage of ssDNA sequencing calls that are C>T are annotated above each sample. Cosine similarity to COSMIC dsDNA signature SBS30 is annotated below each sample, after collapsing ssDNA calls to central pyrimidine trinucleotide contexts and correcting for trinucleotide context opportunities, except for the no-heat treatment samples that do not have sufficient C>T calls (‘N/A’). f, SBS30ss* signature (reproduced from Fig. 4d) compared to spectra of ssDNA calls after 72 °C heat damage of blood DNA for 6 hours (h) in only 10 mM Tris buffer (n=10,852 calls) or only water (n=2,751 calls). Spectra are plotted after correcting for trinucleotide context opportunities. Bottom, odds ratios of spectrum contributions at C>T contexts of the Tris-only and water-only samples compared to SBS30ss* (which was derived from sperm and salt-buffer heat-treated samples). Pyr, pyrimidine, Pur, purine. g, Heat map of average pulse width ratios for ssDNA and dsDNA C>T calls for positions −1 to +6, for blood DNA samples heated at 72 °C for 6 hours in different buffers or water, and for additional samples for comparison. Unbiased clustering (dendrogram) separates kinetic profiles of ssDNA C>T calls from dsDNA C>T calls and from kinetic profiles after randomizing labels of molecules with and without the calls. dsDNA ‘Blood, heat’: blood DNA heat-treated at 56 °C and 72 °C (both 3h and 6h for each); dsDNA ‘Blood’: 4 samples, not heat treated. dsDNA ‘Kidney and liver’: 10 samples, not heat treated. b, HiDEF-seq spectra are corrected for trinucleotide context opportunities. **c,d**, Error bars: standard error of the mean. e, Bars and error bars: point estimates and their Poisson 95% confidence intervals.

**Extended Data Figure 10.. APOBEC3A-induced dsDNA and ssDNA call burdens and patterns.**
**a,b**, Burdens (corrected for trinucleotide context opportunities and sensitivity) of dsDNA mutations (a) and ssDNA calls (b) in fibroblasts transduced with lentivirus-expressing GFP (green fluorescent protein) as a control or APOBEC3A with or without a nuclear localization signal (NLS). Two biological replicates are shown for each condition. c, Spectra of dsDNA mutations corrected for trinucleotide context opportunities. d, Fraction of dsDNA mutations attributed to each dsDNA signature. Cosine similarity of the de novo extracted signature SBSI to the most similar COSMIC SBS signature is shown in parentheses. Cosine similarities of original spectra of samples to spectra reconstructed from component signatures are (left to right): 0.99, 0.98, 0.98, and 0.97. e, Spectra of ssDNA calls corrected for trinucleotide context opportunities. f, SBS2ss* obtained by de novo signature extraction from APOBEC3A samples. Cosine similarity to SBS2 is calculated after projecting to central pyrimidine trinucleotide context. **a,b**, Error bars: Poisson 95% confidence intervals.

**Extended Data Figure 11.. Mitochondrial genome dsDNA mutation rates, similarity between SBS30ss* and mitochondrial genome heavy strand A>G dsDNA mutations, and mitochondrial ssDNA call spectra.**
a, Mitochondrial dsDNA mutation burdens versus age in liver and kidney samples, including liver samples from which mitochondria were enriched. Dashed lines: weighted least-squares linear regression. Shaded ribbon: 95% confidence interval. b, SBS30ss* (cytosine deamination) spectrum is projected to central pyrimidine trinucleotide contexts and compared to mitochondria heavy strand A>G dsDNA mutation spectrum (corrected for trinucleotide context opportunities), for different sample sets: (i) HiDEF-seq liver and kidney samples, including liver samples from which mitochondria were enriched (i.e., same set of samples in Figs. 5a,c and Extended Data Fig. 11a); (ii) 5697 purified liver mitochondria samples only (plot includes 89% of the mutations in (i)); (iii) Sample set (i), excluding the 5697 purified liver mitochondria samples (plot includes 11% of the mutations in (i)). Note, the contexts of SBS30ss* are matched with the reverse complement flanking base contexts of mitochondria heavy strand A>G mutations. The number of dsDNA A>G mutations is indicated. c, Spectrum of mitochondrial ssDNA calls combined from the liver and kidney samples shown in Figs. 5a,c and Extended Data Fig. 11a. The spectrum is corrected for trinucleotide context opportunities, separately for each strand. See Fig. 5d for a spectrum that includes bulk (i.e., non-mitochondria enriched) samples profiled by HiDEF-seq with A-tailing. a, Dots and error bars: point estimates and their Poisson 95% confidence intervals.

**Figure 1.. HiDEF-seq overview.**
a, HiDEF-seq schematic. A-tailing uses dATP and non-A dideoxynucleotides, except for lower-quality post-mortem samples that use only non-A dideoxynucleotides to avoid dATP misincorporation at residual nicks (Extended Data Fig. 5 and Methods). b, Average fraction of molecules across representative HiDEF-seq samples (n=51) and standard PacBio sequencing (HiFi) samples (n=10) in different bins of number of passes per strand (Methods). The average percentage of molecules with ≥ 5 and ≥ 20 passes per strand is 99.8% and 70% for HiDEF-seq, respectively, and 78.7% and 0.1% for HiFi, respectively. Plot shows HiDEF-seq molecules output by the pipeline’s primary data processing step. X-axis brackets and parentheses signify inclusion and exclusion of bin endpoints, respectively. c, HiDEF-seq and NanoSeq dsDNA mutation burdens in sperm samples (left to right: SPM-1013, SPM-1002, SPM-1004, SPM-1020, SPM-1060) compared for each age to paternally-phased de novo mutations from a prior study. d, HiDEF-seq dsDNA mutation burdens in human tissues (Supplementary Table 1). Dashed lines: weighted least-squares linear regression. **e,f**, HiDEF-seq versus NanoSeq dsDNA mutation burdens (e) and ssDNA call burdens (f). Samples (top to bottom in legend): SPM-1013, SPM-1002, SPM-1004, SPM-1020, SPM-1060, 1443, 1105, 6501, 63143. Only sample 63143 (POLE p.M444K) is from an individual with a cancer predisposition syndrome. Dashed line, y = x (expectation for concordance). g, HiDEF-seq versus NanoSeq ssDNA call burdens separated by call type. For each call type, each bar represents a different sperm sample (left-to-right, same samples as in (c)). b, Error bars: standard deviation. **c-f**, Dots and error bars: point estimates and their Poisson 95% confidence intervals. c, Box plots: middle line, median; boxes, 1st and 3rd quartiles; whiskers, 5% and 95% quantiles. **c,e,f**, For each sample, HiDEF-seq and NanoSeq confidence intervals were normalized to reflect an equivalent number of interrogated base pairs (c,e) or bases (f) (Methods). **c,e-g**, yo, years old; mo, months old.

**Figure 2.. ssDNA call burdens and patterns in cancer predisposition syndromes.**
a, ssDNA call burdens in blood (B), fibroblasts (F), and lymphoblastoid cell lines (L) from individuals without and with cancer predisposition syndromes. Burdens are corrected for trinucleotide context opportunities and detection sensitivity (Methods). ***, p=2·10⁻¹⁰ for mismatch repair and p<10⁻¹⁵ for polymerase proofreading, versus non-cancer predisposition samples (two-sided Poisson rates ratio test, combining calls and interrogated bases from each group, with Holm multiple comparisons adjustment). Results were also significant when including only blood samples. Samples, left-to-right: 5203, 1105, 1301, 6501, 1901, GM12812, GM02036, GM03348, GM16381, GM01629, GM28257, 55838, 58801, 57627, 1400, 1324, 1325, 60603, 59637, 57615, 63143 (L), 63143 (B), CC-346–253, CC-388–290, CC-713–555. Cancer predisposition samples are ordered as in (b) that lists affected genes. b, ssDNA call burdens by context, corrected for trinucleotide context opportunities. *, p=0.03; ***, p=0.0008 (heteroscedastic two-tailed t-test, adjusted for multiple comparisons). Only non-cancer predisposition samples with > 30 ssDNA calls are included (1105, 1301, 1901, GM12812, GM03348), since patterns are not reliably ascertained with fewer calls. However, GM16381 (XPC) with < 30 calls is included for completeness in showing all cancer predisposition samples. **c,d**, Spectra of ssDNA calls (c) and dsDNA mutations (d) for representative *POLE* PPAP sample 57615, corrected for trinucleotide context opportunities. e, Top, ssDNA mismatch signature SBS10ss extracted from all PPAP samples while simultaneously fitting SBS30ss* (Fig. 4d). Middle, SBS10ss projected to central pyrimidine contexts by summing central pyrimidine values and their reverse complement central purine values to allow comparison to dsDNA signatures. Bottom, dsDNA mutational signature (sum of SBSE and SBSF) extracted from PPAP samples. f, Fraction of ssDNA calls attributed to ssDNA signatures in PPAP samples (same PPAP sample order as (a)). Cosine similarities of original spectra to spectra reconstructed from signatures (left-to-right): 0.94, 0.97, 0.97, 0.85. **a,b**, See Supplementary Tables 1,2 for sample details. a, Error bars: Poisson 95% confidence intervals.

**Figure 3.. Hypermutating tumors deficient in both mismatch repair and polymerase proofreading.**
Spectra of ssDNA calls (top) and dsDNA mutations (bottom) in tumor samples corrected for trinucleotide context opportunities. Parentheses show the total number of raw calls and the percentage of calls that are C>T after correction for trinucleotide context opportunities. Blue annotation on the top right of each ssDNA spectrum is the cosine similarity of only the ssDNA C>T calls to SBS30ss* (see Fig. 4d for details of SBS30ss*). Also annotated are the cosine similarities of each sample’s full ssDNA call spectrum (projected to central pyrimidine context) to its dsDNA mutation spectrum, for all ssDNA calls and after excluding ssDNA C>T calls (most of which are due to SBS30ss* cytosine deamination). Medulloblastoma ID: Tumor 8; Glioblastoma ID: Tumor 10. See Supplementary Table 1 for sample details.

**Figure 4.. ssDNA damage signatures of sperm and heat-treated DNA.**
a, Spectrum of all ssDNA calls of non-cancer predisposition blood samples (one sample each from individuals 1105, 1301, 5203, and 6501, and five samples from individual 1901). Cosine similarity to COSMIC SBS30 is calculated after projecting the ssDNA spectrum to central pyrimidine contexts. b, dsDNA mutation and ssDNA call burdens of heat-treated DNA. c, ssDNA call spectra of representative sperm and heat-treated blood DNA samples, and SBS30 for comparison. d, SBS30ss* obtained by de novo signature extraction from central pyrimidine ssDNA calls of sperm and heat-treated samples. Cosine similarity to SBS30 is calculated after projecting to central pyrimidine contexts. e, Schematic of pulse width (PW) and interpulse duration (IPD) measured for incorporated bases during sequencing. f, Heat map of average pulse width ratios for positions −1 to +6, which is the polymerase footprint harboring a kinetic signal that differs from the flanking baseline. Unbiased hierarchical clustering (dendrogram) separates ssDNA C>T calls from dsDNA C>T mutations and from kinetic profiles with randomized molecule labels. Positions +1 and +3 (stars) best discriminate ssDNA C>T damage from dsDNA C>T mutations. dsDNA ‘Blood, heat’: samples heat-treated at 56C and 72C (both 3h and 6h for each; h, hours); dsDNA ‘Blood’: 4 samples. dsDNA ‘Kidney and liver’: 10 samples. **a,c,d**, HiDEF-seq spectra are corrected for trinucleotide context opportunities. b, Bars and error bars: point estimates and their Poisson 95% confidence intervals. P-values, left-to-right (Poisson rates ratio test, two-sided): 0.001, 0.35 (ns, not significant), < 10⁻¹⁵, < 10⁻¹⁵.

**Figure 5.. Mitochondrial genome dsDNA and ssDNA call burdens and patterns.**
a, Nuclear versus mitochondrial genome dsDNA mutation rates. Mitochondrial rates are from the regressions in Extended Data Fig. 11a, which were similarly performed for the nuclear genome and for liver and kidney samples combined. P-values ANOVA comparing two weighted least-squares linear regression models of mutation burden versus age and genome type covariates: one with and one without an ‘age x genome type’ interaction term (an estimate of the difference in dsDNA mutation rate depending on whether it is the nuclear or mitochondrial genome). b, ssDNA call burdens in the nuclear versus mitochondrial genomes after combining the calls of liver and kidney samples shown in Extended Data Fig. 11a, excluding from the nuclear genome burden the liver samples from which mitochondria were enriched, since due to low DNA inputs these samples were profiled with HiDEF-seq with A-tailing that induces ssDNA T>A artifacts in the nuclear genome of post-mortem liver. P-value, Poisson rates ratio test, two-sided. c, dsDNA mutation spectrum, corrected for trinucleotide context opportunities, of the liver and kidney samples shown in Extended Data Fig. 11a for the mitochondrial genome heavy strand, separated by pyrimidine (top) and purine (bottom) contexts. d, Spectrum of mitochondrial ssDNA calls combined from the liver and kidney samples shown in Extended Data Fig. 11a plus all bulk (i.e., non-mitochondria enriched) liver and kidney samples profiled by HiDEF-seq with A-tailing, since the ssDNA T>A artifact that A-tailing can incur in these post-mortem tissues (Supplementary Note 1) is orthogonal to the contexts of mitochondrial mutagenesis. Spectra are corrected for trinucleotide context opportunities, separately for each strand. Excluding bulk samples profiled by HiDEF-seq with A-tailing yields a similar spectrum (Extended Data Fig. 11c). a, Error bars: 95% confidence intervals from regressions. b, Bars and error bars: point estimates and their Poisson 95% confidence intervals.

See this image and copyright information in PMC

Update of

Single-strand mismatch and damage patterns revealed by single-molecule DNA sequencing.
Liu MH, Costa B, Choi U, Bandler RC, Lassen E, Grońska-Pęski M, Schwing A, Murphy ZR, Rosenkjær D, Picciotto S, Bianchi V, Stengs L, Edwards M, Loh CA, Truong TK, Brand RE, Pastinen T, Wagner JR, Skytte AB, Tabori U, Shoag JE, Evrony GD. Liu MH, et al. bioRxiv [Preprint]. 2023 Feb 19:2023.02.19.526140. doi: 10.1101/2023.02.19.526140. bioRxiv. 2023. Update in: Nature. 2024 Jun;630(8017):752-761. doi: 10.1038/s41586-024-07532-8. PMID: 36824744 Free PMC article. Updated. Preprint.

References

1. Mustjoki S & Young NS Somatic Mutations in “Benign” Disease. New England Journal of Medicine 384, 2039–2052 (2021). - PubMed
1. Vijg J & Dong X Pathogenic Mechanisms of Somatic Mutation and Genome Mosaicism in Aging. Cell 182, 12–23 (2020). - PMC - PubMed
1. Seplyarskiy VB & Sunyaev S The origin of human mutation in light of genomic data. Nature Reviews Genetics 22, 672–686 (2021). - PubMed
1. Koh G, Degasperi A, Zou X, Momen S & Nik-Zainal S Mutational signatures: emerging concepts, caveats and clinical applications. Nature Reviews Cancer 21, 619–637 (2021). - PubMed
1. Evrony GD et al. Single-neuron sequencing analysis of L1 retrotransposition and somatic mutation in the human brain. Cell 151, 483–496 (2012). - PMC - PubMed

Extended Data Figures and Methods References

1. Agarwal A, Gupta S & Sharma R in Andrological Evaluation of Male Infertility: A Laboratory Guide (eds Ashok Agarwal, Sajal Gupta, & Rakesh Sharma) 101–107 (Springer International Publishing, 2016).
1. Buisson R et al. Passenger hotspot mutations in cancer driven by APOBEC3A and mesoscale genomic features. Science 364, eaaw2872 (2019). - PMC - PubMed
1. Wu H, de Gannes MK, Luchetti G & Pilsner JR Rapid method for the isolation of mammalian sperm DNA. BioTechniques 58, 293–300 (2015). - PMC - PubMed
1. Jenkins TG, Liu L, Aston KI & Carrell DT Pre-screening method for somatic cell contamination in human sperm epigenetic studies. Systems Biology in Reproductive Medicine 64, 146–155 (2018). - PubMed
1. Nurk S et al. The complete sequence of a human genome. Science 376, 44–53 (2022). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DNA mismatch and damage patterns revealed by single-molecule sequencing

Affiliations

DNA mismatch and damage patterns revealed by single-molecule sequencing

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

Extended Data Figures and Methods References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials