Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr;22(4):681-691.
doi: 10.1038/s41592-025-02631-4. Epub 2025 Mar 28.

Uncalled4 improves nanopore DNA and RNA modification detection via fast and accurate signal alignment

Affiliations

Uncalled4 improves nanopore DNA and RNA modification detection via fast and accurate signal alignment

Sam Kovaka et al. Nat Methods. 2025 Apr.

Abstract

Nanopore signal analysis enables detection of nucleotide modifications from native DNA and RNA sequencing, providing both accurate genetic or transcriptomic and epigenetic information without additional library preparation. At present, only a limited set of modifications can be directly basecalled (for example, 5-methylcytosine), while most others require exploratory methods that often begin with alignment of nanopore signal to a nucleotide reference. We present Uncalled4, a toolkit for nanopore signal alignment, analysis and visualization. Uncalled4 features an efficient banded signal alignment algorithm, BAM signal alignment file format, statistics for comparing signal alignment methods and a reproducible de novo training method for k-mer-based pore models, revealing potential errors in Oxford Nanopore Technologies' state-of-the-art DNA model. We apply Uncalled4 to RNA 6-methyladenine (m6A) detection in seven human cell lines, identifying 26% more modifications than Nanopolish using m6Anet, including in several genes where m6A has known implications in cancer. Uncalled4 is available open source at github.com/skovaka/uncalled4 .

PubMed Disclaimer

Conflict of interest statement

Competing interests: W.T. has two patents (8,748,091 and 8,394,584) licensed to ONT. S.K. has received travel funding from ONT. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Pore model and alignment methods overview.
a, Schematics of ONT sequencing chemistries, their pore k-mer model current distributions, and nucleotide compositions of k-mers within current ranges indicated by dashed lines. b, A signal-to-reference dotplot of an Escherichia coli 16S rRNA read sequenced using ONT r9.4 direct RNA sequencing. Top panel shows the raw samples (black) plotted over the reference base it was aligned to, with the expected pore model current in white. Main panel shows the Uncalled4 read alignment (purple line) over the projected basecaller metadata alignment (orange dots). Side panels show per-reference coordinate summary statistics for the alignment. c, Schematic of Uncalled4 inputs, outputs and subcommands (Methods). d, A trackplot displaying heatmaps of many native (bottom) and IVT (top) E. coli 16S rRNA reads aligned by Uncalled4, colored by the difference between the observed and expected normalized current level. Top bar is colored by reference base, and an O6-methylguanine site is known to occur at position 526. e, A refplot summarizing the distributions of differences between observed and expected normalized current levels for native (purple) and IVT (green) reads. e, A comparative signal-to-reference dotplot alongside distance (dist.) metrics between Uncalled4 and Nanopolish alignments of the same read, where line breaks in the distance plots correspond to regions masked by Nanopolish. norm., normalized.
Fig. 2
Fig. 2. Current distribution and nucleotide composition of k-mers in Uncalled4-trained pore models.
a, Pore model k-mer substitution profile heatmaps: the mean normalized current difference observed by substituting each base (y axis) at each k-mer position (x axis) averaged over all k-mers in the model. b, Mean and standard deviation of current surrounding a 9-mer adenine homopolymer in the D. melanogaster genome, based on Uncalled4 alignments of r9.4.1 and r10.4.1 DNA reads. Solid lines show the mean normalized current of all reads aligned at each position, and the shaded region shows the standard deviation of the mean current levels. c, Fraction of basecalled reads containing a deletion within homopolymers of length nine or longer in the D. melanogaster X chromosome, computed using samtools mpileup. d, Distributions of per-read k-mer dwell times output by Uncalled4 alignments of 4,000 randomly sampled reads from D. melanogaster DNA and human HEK293T RNA. Boxes span the first and third quartiles with the median indicated by the horizontal line, and whiskers extend to 1.5 times the interquartile range. e, Standard deviation of per-k-mer dwell times relative to central pore position, where each offset along the x axis indicates the standard deviation of median dwell times for each 5-mer at that position.
Fig. 3
Fig. 3. 5mCpG signal characteristics.
a, Normalized current levels for Uncalled4 5mC (x axis) and unmodified control (y axis) r10.4.1 pore models, reduced to 4-mers by averaging k-mers sharing their last four bases. Each point is colored by the identity of the central base, with diamonds representing CpG containing k-mers. Outlined diamonds indicate k-mers with the modified cytosine at central position (C[G]) or one base upstream ([C]G). b, Current-level z-scores and KS statistics of differences in current between control and 5mCpG datasets for each position surrounding CpG sites in the D. melanogaster X chromosome, computed from Uncalled4 and f5c r10.4.1 signal alignments using the ONT r10.4.1 400-bps model. KS statistic lines show the median normalized current difference between the signal aligned at each position and the pore model, and shaded regions show the interquartile range of the current difference. c, Normalized current levels for Uncalled4 BrdU and unmodified datasets trained on r9.5 data, with each point colored by the identity of the central position (second base) of each 5-mer and diamonds indicating k-mers containing a ‘T’. d, ROC curve of classification of BrdU versuss unmodified k-mers based on Uncalled4 alignments using the Uncalled4-trained or DNAScent2 pore models compared to a control model.
Fig. 4
Fig. 4. HEK293T RNA m6A detection.
a, AUPRC for comparative RNA m6A modification detection using Uncalled4, Nanopolish and Tombo alignments as input to KS statistics, xPore or m6Anet. ‘Uncalled4 (spliced)’ (magenta) is based on spliced genome alignments, while all others use transcriptome alignments averaged to the gene level. b, Precision-recall curves comparing m6Anet performance on a single HEK293T sample using Uncalled4 or Nanopolish alignments. Solid lines include all sites that are covered by basecalled alignments by at least 20 reads, where sites not output by either tool are assigned a score of zero, generating a large discontinuity for Nanopolish due to pervasive masking. The dashed curves include all sites output by either aligner (union), while dotted curves only include sites output by both aligners (intersection). c, True positive rate and precision binned by basecalled read coverage. m6Anet probability threshold was selected such that the overall precision for each aligner equals 90% (dashed horizontal line in b).
Fig. 5
Fig. 5. RNA modification detection across seven human cell lines.
a, Number of m6A sites found in each cell line that occur in the m6A-Atlas v.2 pTPs. Solid bars indicate the number of sites found with the default probability threshold 0.9, and shaded bars indicate the count at threshold where the putative positive predictive values (pPPV) is 85%. Uncalled4 with NA12878 has reduced recall at 85% pPPV, as indicated by dashed line. b, Coverage distribution of true positive (pTP) sites (top) and pPPV of sites within coverage bins. c, Number of sites shared by Uncalled4, Nanopolish and m6A-atlas v.2 across all cell lines. d, Difference in per-gene m6A count found by Uncalled4 and Nanopolish across all seven cell lines. e, Difference in aggregated gene m6A count found by Uncalled4 versus Nanopolish alignments, limited to COSMIC tier 1 genes where at least one m6A modification is found in every cell line by either tool (51 genes). Negative (green) values indicate genes where more m6A sites were found by Nanopolish, and positive (purple) values indicate more m6A sites found by Uncalled4. f, Transcript-level m6A calls in an ABL1 transcript alongside BCR fusion. g, Gene-level m6A calls in the TTC4 gene.
Extended Data Fig. 1
Extended Data Fig. 1. Comparisons between RNA pore model per-k-mer current means.
(a) Comparison between the five central bases of ONT’s 9-mer RNA004 model and an Uncalled4-trained RNA002 5-mer model. (b) Uncalled4-trained RNA002 model compared with and ONT ‘rna_r9.4_180mv_70bps’ model, which is the default model that Uncalled4 and Nanopolish use for RNA001 or RNA002. (c) Boxplots showing distribution of differences between the mean current of signal aligned to the HEK293T reference and the current predicted by the k-mer model. Boxes span the first and third quartiles with the median indicated by the horizontal line, and whiskers extend to 1.5 times the interquartile range.
Extended Data Fig. 2
Extended Data Fig. 2. Illustration of basecaller-guided DTW.
(a) Generating of ref-moves from raw basecaller moves and a minimap2 alignment. The minmap2 ‘CIGAR’ corresponding to the basecalled read alignment is ‘9M1I6M1D3M’. K-mers coordinates are defined relative to the central base, which is defined for each pore model based on its substitution matrix (Fig. 2a). (b) A standard NxM DTW matrix, where N = M = 5. Cells are colored by their Manhattan distance from (1,1), which corresponds to the band which they will be contained in. The red line represents the ref-moves which will guide band placement. (c) The same DTW matrix overlaid with bands centered on the ref-moves (band width W = 3). (d) The DTW band matrix with each row offset by its location in the NxM matrix, which is shaded in the background and rotated 45o. White cells indicate out-of-bounds coordinates. Band start coordinates are indicated by the colored numbers to the left. (e) The DTW band matrix, represented as a standard two-dimensional array. Note that the start coordinates are required to reconstruct the original matrix structure.
Extended Data Fig. 3
Extended Data Fig. 3. Signal-to-read and signal-to-reference alignment.
(a) Per-k-mer current means from signal-to-read and signal-to-reference Uncalled4 alignments and uncorrected basecaller moves. (b) Alignment dotplot of a D. melanogaster r10.4.1 read to a reference containing a spiked-in G- > T substitution at the location indicated by the red lines, causing increased read-model current MAD. (c) Alignment dotplot of the same read to a reference with a 10 nucleotide deletion with boundaries indicated by red lines. Uncalled4 masks signal around insertions or deletions 10 nucleotides or larger based on the ref-moves coordinates, meaning the signal corresponding to the deleted sequence is not included. (d) Reference- and read-aligned signal of a read which features a likely sequencing error causing a two nucleotide insertion in the basecalled sequence. A slight jump in signal is observed within the signal mapping to ‘A’ in the signal-to-reference alignment, indicated by the green arrow. This nucleotide is broken into ‘GAG’ in the basecalled sequence, making the signal-to-read alignment erroneously more similar to the pore model.
Extended Data Fig. 4
Extended Data Fig. 4. Modification signal trackplots.
Trackplots displaying per-read-k-mer mean current levels for Uncalled4 and Nanopolish in-vitro transcribed and native E. coli ribosomal RNA sequenced with ONT RNA002. An O6-methylguanine site is present in the native dataset at position 526, causing a drop in current. White cells indicate masked positions, where Uncalled4 performs no masking in this dataset because there were no large insertions or deletions, while Nanopolish masks many positions particularly around the modification site.
Extended Data Fig. 5
Extended Data Fig. 5. DNA model training results.
(a) Current levels from Uncalled4 and ONT r9.4.1 6-mer DNA models. (b) Current levels from Uncalled4 and ONT r10.4.1 400 bps 9-mer DNA models. Inset displays sequence logo for k-mers with more than 0.5 normalized units of difference between the models (indicated on main plot by dashed line). (c) Current distributions for k-mers with each base fixed at the 6th and 5-th positions for r10.4.1 models, including both 400 bps and 260 bps ONT models. Most distributions are unimodal, except for ONT 400 bps which has outliers caused by ‘TVTT’ k-mers. (d) Comparison between Uncalled4’s r10.4.1 400 bps model and ONT’s 260 bps model, which lacks the outliers seen in ONT’s 400 bps model.
Extended Data Fig. 6
Extended Data Fig. 6. Transcript-level comparative m6A detection.
Precision recall and ROC curves for transcript-level comparative m6A detection in HEK293t in all contexts (a-b) and limited to DRACH sites (c-d).
Extended Data Fig. 7
Extended Data Fig. 7. Gene-level modification detection methods.
(a) Illustration of transcript-level modification calling, genome-level calling, and translation of transcript-level calls to the gene-level (t2g). (b) Precision-recall and (c) ROC curves of Uncalled4 and Nanopolish gene-level calls using KS statistics. ‘Splice’ indicates Uncalled4 spliced genome alignment. ‘Multi t2g’ indicates transcript-to-gene averaging using all multi-mapping reads, while ‘pri t2g’ indicates the same but only using primary alignments.
Extended Data Fig. 8
Extended Data Fig. 8. Gene-level comparative m6A detection.
Precision recall and ROC curves for gene-level comparative m6A detection in HEK293t in all contexts (a-b) and limited to DRACH sites (c-d). ‘splice’ indicates Uncalled4 spliced genome alignment. All other methods used transcriptome alignments with all multi-mappers included, averaged to the gene-level.
Extended Data Fig. 9
Extended Data Fig. 9. Gene-level HEK293T m6Anet.
Gene-level HEK293T m6Anet calls via transcript-to-genome (t2g) averaging. (a) Precision-recall and (b) ROC curves using GLORI labels with no level threshold. (c) Areas under the precision-recall and (d) ROC curves using different thresholds on GLORI levels.
Extended Data Fig. 10
Extended Data Fig. 10. Cell line m6Anet analysis.
(a) Distance from annotated stop codon for transcript-level m6A sites found by Uncalled4 (purple) and Nanopolish (green) with m6Anet at matched 85% precision. (b) Metagene plot of the same m6A sites, with the distribution of reference DRACH sites (gray) and DRACH sites covered by the nanopore reads (orange). (c) Gene-level m6A counts by the number of cell lines they were found in, divided into putative ‘true positives’ (TP, in the m6A-Atlas) and putative ‘false positives’ (FP, missing from the m6A-Atlas).

Update of

References

    1. Nurk, S. et al. The complete sequence of a human genome. Science376, 44–53 (2022). - PMC - PubMed
    1. Glinos, D. A. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature608, 353–359 (2022). - PMC - PubMed
    1. Kovaka, S., Ou, S., Jenike, K. M. & Schatz, M. C. Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing. Nat. Methods20, 12–16 (2023). - PMC - PubMed
    1. Gershman, A. et al. Epigenetic patterns in a complete human genome. Science376, eabj5089 (2022). - PMC - PubMed
    1. Loman, N. J., Quick, J. & Simpson, J. T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods12, 733–735 (2015). - PubMed

LinkOut - more resources