Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 14;8(1):606.
doi: 10.1038/s42003-025-08009-8.

Transformer-based deep learning for accurate detection of multiple base modifications using single molecule real-time sequencing

Affiliations

Transformer-based deep learning for accurate detection of multiple base modifications using single molecule real-time sequencing

Xi Hu et al. Commun Biol. .

Abstract

We had previously reported a convolutional neural network (CNN) based approach, called the holistic kinetic model (HK model 1), for detecting 5-methylcytosine (5mC) by single molecule real-time sequencing (Pacific Biosciences). In this study, we constructed a hybrid model with CNN and transformer layers, named HK model 2. We improve the area under the receiver operating characteristic curve (AUC) for 5mC detection from 0.91 for HK model 1 to 0.99 for HK model 2. We further demonstrate that HK model 2 can detect other types of base modifications, such as 5-hydroxymethylcytosine (5hmC) and N6-methyladenine (6mA). Using HK model 2 to analyze 5mC patterns of cell-free DNA (cfDNA) molecules, we demonstrate the enhanced detection of patients with hepatocellular carcinoma, with an AUC of 0.97. Moreover, HK model 2-based detection of 6mA enables the detection of jagged ends of cfDNA and the delineation of cellular chromatin structures. HK model 2 is thus a versatile tool expanding the applications of single molecule real-time sequencing in liquid biopsies.

PubMed Disclaimer

Conflict of interest statement

Competing interests: K.C.A.C. and Y.M.D.L. hold equities in DRA, Take2, and Insighta. K.C.A.C. is a Director of DRA, Take2, and Insighta. P.J. holds equities in Illumina. P.J. is a Director of DRA, KingMed Future, and Take2. W.K.J.L. is a director of DRA. X.H., P.J., K.C.A.C., and Y.M.D.L. have filed a patent application based on this work, which has recently been licensed to Pacific Biosciences.

Figures

Fig. 1
Fig. 1. A schematic of the model structure of HK model 2.
Subreads generated from single-molecule real-time sequencing (SMRT-seq) are aligned to the corresponding circular consensus sequence (CCS), and the kinetic features are established for individual nucleotides. Such kinetic features include inter-pulse duration (IPD) and pulse width (PW) (Top left). DNA is double-stranded, thus subreads can be derived from the Watson and Crick strands. As SMRT-seq utilizes a circularized DNA template, the DNA polymerase (yellow) conducts multiple laps of continuous and processive polymerization based on fluorescently labeled nucleotides, namely A (adenine), C (cytosine), G (guanine), and T (thymine) (Top right), producing a number of aforementioned subreads from the same DNA template. The colors of fluorescent pulses during sequencing are used to determine the identity of each base. The trajectory of these fluorescent signals helps measure two key kinetic features, namely, IPD and PW. The IPD reflects the time interval between two consecutive base incorporations, while PW indicates how long a base incorporation event lasts. Due to the repeated measurement nature of SMRT sequencing, the collective use of subreads from the same molecule can improve the sequencing accuracy and quantification of the kinetics of polymerase which would be influenced by base modifications present in the template [e.g. 5mC (5-methylcytosine), 5hmC (5-hydroxymethylcytosine), or 6mA (N6-methyladenine)]. Furthermore, the holistic kinetic (HK) model 2 framework is illustrated at the bottom. The kinetic signals of sequenced nucleotides within a flanking region around a query site (e.g. a C nucleotide at the CG context) are organized into an input matrix based on their base identities and positions, forming a measurement window. The input matrix is processed through convolutional layers, which extract local kinetic patterns associated with base modification. The output of these layers, combined with positional embeddings encoding relative nucleotide positions, is passed into transformer layers, which capture kinetic relationships across the measurement. The output layer generates probabilities for different types of base modification (referred to as base modification scores). Base modifications predicted by current HK model 2 include 5mC, 5hmC, and 6mA.
Fig. 2
Fig. 2. The performance comparison of HK model 1 and 2.
A Receiver Operating Characteristic (ROC) curves for the testing dataset on the basis of different models. B Area under ROC curve (AUC-ROC) values across different subread depths of SMRT-seq. Error bars represent one standard deviation of AUC among five repeated measurements. C Precision-Recall (PR) curves for the testing dataset on the basis of different models. D Area under PR curve (AUC-PR) values across different subread depths of SMRT-seq. Error bars represent one standard deviation of AUC among five repeated measurements. E Percentage of callable CpG sites at relative positions of DNA molecules. The grey area indicates the no-call region of the HK model 1. F ROC curve of HK model 2 for analyzing the CpG sites within the 10-nt distance relative to the nearest 5’ end of sequenced DNA fragments.
Fig. 3
Fig. 3. The evaluation of strand-specific HK model 2.
A ROC curves of strand-specific HK model 2 between Dataset 03 and Dataset 02. B AUC-ROC of methylation analysis for CpG sites at positions relative to the nearest end of sequenced fragments between Dataset 02 (by protocol A) and 03 (by protocol B).
Fig. 4
Fig. 4. Schematic workflow for differentiating between 5mC and 5hmC modifications using M.SssI-treated and Ligation-based DNA.
A Illustration of the composition of TET-treated DNA. B Illustration of the preparation for the 5hmC detection dataset (named Lig-5hmC) based on a ligation method. C The analytical workflow for 5mC and 5hmC detection in SMRT-seq. D ROC curves of the testing datasets for the 5xC and 5hmC detection. E Box plots of modification scores for 5hmC detection in the testing dataset.
Fig. 5
Fig. 5. Detection of 5mC and 5hmC modifications in biological samples including the human brain and buffy coat samples.
A Methylation levels measured by different approaches in buffy coat and brain samples across different genomic regions of interest. CGI: CpG island, LINE: long interspersed nuclear element, LTR: long terminal repeat (B) Methylation levels predicted by HK model 2 in human brain samples around TSS sites. C Correlation of the 5xC levels measured by the HK model 2 and BS-seq. D Correlation of the 5hmC levels measured by the HK model 2 and TAB-seq.
Fig. 6
Fig. 6. Evaluation of 6mA analysis based on HK model 2 trained through the use of whole-genome amplification with the presence of unmethylated or methylated adenines.
A Schematic for preparing the unmethylated and methylated adenine datasets (i.e. uA and 6mA datasets). B IPD distributions in uA and 6mA datasets. C ROC curves of 6mA detection based on HK model 2 and only the IPD metric. D False positive rates of 6mA detection based on HK model 2 and the IPD metric only. Error bars represent one standard deviation of false positive rates among five repeated measurements. E 6mA methylation levels determined by HK model 2 in non-GATC and GATC contexts in the Dam-treated DNA sample.
Fig. 7
Fig. 7. Detection of 6mA in microbes.
A 6mA methylation levels determined by HK model 2. B de novo motif analysis related to 6mA modifications across various microbes.
Fig. 8
Fig. 8. Potential applications of HK model 2.
A HCC methylation scores were determined by HK model 2 in healthy individuals (n = 15), HBV carriers (n = 13), and HCC patients (n = 13) using sequenced DNA molecules with 1 to 6 CpG sites. B ROC curves of using HCC methylation score for classifying individuals with and without HCC based on molecules with 1 to 6 CpG sites or at least 7 CpG sites. C The jaggedness profile of plasma DNA in a healthy individual. D Patterns of 6mA levels in genomic sites relative to CTCF binding sites.

References

    1. Tse, O. Y. O. et al. Genome-wide detection of cytosine methylation by single molecule real-time sequencing. Proc. Natl Acad. Sci. USA118, e2019768118 (2021). - DOI - PMC - PubMed
    1. Portik D. Extracting CpG methylation from PacBio HiFi whole genome sequencing. https://www.pacb.com/wp-content/uploads/AGBT-2022-extracting-CpG-methyla... (2022).
    1. Ni, P., et al. DNA 5-methylcytosine detection and methylation phasing using PacBio circular consensus sequencing. Nat. Commun.14, 4054 (2023). - DOI - PMC - PubMed
    1. Yu, S. C. Y. et al. Single-molecule sequencing reveals a large population of long cell-free DNA molecules in maternal plasma. Proc. Natl Acad. Sci. USA118, e2114937118 (2021). - DOI - PMC - PubMed
    1. Choy, L. Y. L. et al. Single-molecule sequencing enables long cell-free DNA detection and direct methylation analysis for cancer patients. Clin. Chem.68, 1151–1163 (2022). - DOI - PubMed

MeSH terms

LinkOut - more resources