Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Mar 10:2024.03.05.583511.
doi: 10.1101/2024.03.05.583511.

Uncalled4 improves nanopore DNA and RNA modification detection via fast and accurate signal alignment

Uncalled4 improves nanopore DNA and RNA modification detection via fast and accurate signal alignment

Sam Kovaka et al. bioRxiv. .

Update in

Abstract

Nanopore signal analysis enables detection of nucleotide modifications from native DNA and RNA sequencing, providing both accurate genetic/transcriptomic and epigenetic information without additional library preparation. Presently, only a limited set of modifications can be directly basecalled (e.g. 5-methylcytosine), while most others require exploratory methods that often begin with alignment of nanopore signal to a nucleotide reference. We present Uncalled4, a toolkit for nanopore signal alignment, analysis, and visualization. Uncalled4 features an efficient banded signal alignment algorithm, BAM signal alignment file format, statistics for comparing signal alignment methods, and a reproducible de novo training method for k-mer-based pore models, revealing potential errors in ONT's state-of-the-art DNA model. We apply Uncalled4 to RNA 6-methyladenine (m6A) detection in seven human cell lines, identifying 26% more modifications than Nanopolish using m6Anet, including in several genes where m6A has known implications in cancer. Uncalled4 is available open-source at github.com/skovaka/uncalled4.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Pore model and alignment methods overview.
(a) Schematics of Nanopore sequencing chemistries and their pore k-mer substitution profiles. Heatmaps show the mean normalized current difference observed by substituting each base (y-axis) at each k-mer position (x-axis) averaged over all k-mers in the model. (b) A signal-to-reference dotplot of an Escherichia coli 16S ribosomal RNA (rRNA) read sequenced using ONT r9.4 direct RNA sequencing. Top panel shows the raw samples (black) plotted over the reference base it was aligned to, with the expected pore model current in white. Main panel shows the Uncalled4 read alignment (purple line) over the projected basecaller metadata alignment (orange dots). Side panels show per-reference coordinate summary statistics for the alignment. (c) A comparative signal-to-reference dotplot and distance metrics between the alignments. (d) A trackplot displaying heatmaps of many native (top) and in vitro transcribed (IVT, bottom) E. coli 16S rRNA reads aligned by Uncalled4, colored by the difference between the observed and expected normalized current level. Top bar is colored by reference base, and an O6-methylguanine site is known to occur at position 526. (e) A refplot summarizing the distributions of differences between observed and expected normalized current levels for native (purple) and IVT (green) reads. (f) Schematic of Uncalled4 inputs, outputs, and subcommands (see Methods).
Figure 2.
Figure 2.. Current distribution and nucleotide composition of k-mers in Uncalled4 trained pore models.
Plots represent (a) r9.4.1 DNA, (b) r10.4.1 DNA, and (c) r9.4.1 RNA (RNA002). ONT pore models are highly similar and produce nearly identical figures. (d) Mean and standard deviation of current surrounding a 9-mer adenine homopolymer in the D. melanogaster genome, based on Uncalled4 alignments of r9.4.1 and r10.4.1 DNA reads. (e) Fraction of basecalled reads containing a deletion within homopolymers of length nine or longer in the D. melanogaster X chromosome, computed using samtools mpileup.
Figure 3.
Figure 3.. 5mCpG signal characteristics.
(a) Normalized current levels for Uncalled4 5-methylcytosine (x-axis) and unmodified control (y-axis) r10.4.1 pore models, reduced to 4-mers by averaging k-mers sharing their last four bases. Each point is colored by the identity of the central base, with diamonds representing CpG containing k-mers. Outlined diamonds indicate k-mers with the modified cytosine at central position (C[G]) or one base upstream ([C]G). (b) Current-level KS statistic mean and interquartile ranges surround 5mCpG sites in the D. melanogaster X chromosome, computed from Uncalled4 and f5c r10.4.1 signal alignments using the ONT r10.4.1 400bps model.
Figure 4.
Figure 4.. RNA modification detection.
(a) Gene-level comparative m6A detection in DRACH contexts. “Uncalled4 (spliced)” (magenta) is based on spliced genome alignments, while all other use transcriptome alignments averaged to the gene-level. (b) Number of m6A sites found in each cell line which occur in the m6A-Atlas v2. Solid bars indicate the number of sites found with the default probability threshold 0.9, and shaded bars indicate the count at threshold where the precision is 85%. Uncalled4 with NA12878 has reduced recall at 85% precision, as indicated by dashed line. Precision with default probability threshold of 0.9. (c) Coverage distribution of true positive (TP) sites (top) and precision of sites within coverage bins. (d) Number of sites shared by Uncalled4, Nanopolish, and m6A-atlas v2 across all cell lines. (e) Difference in per-gene m6A count found by Uncalled4 and Nanopolish across all seven cell lines. (f) Difference in aggregated gene m6A count from Uncalled4 and Nanopolish alignments in COSMIC tier 1 genes with m6A modification found in every cell line (51 genes). (g) Transcript-level m6A calls in an ABL1 transcript alongside BCR fusion, and (h) gene-level m6A calls in the TTC4 gene.

References

    1. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376: 44–53. doi:10.1126/science.abj6987 - DOI - PMC - PubMed
    1. Glinos DA, Garborcauskas G, Hoffman P, Ehsan N, Jiang L, Gokden A, et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature. 2022;608: 353–359. doi:10.1038/s41586-022-05035-y - DOI - PMC - PubMed
    1. Kovaka S, Ou S, Jenike KM, Schatz MC. Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing. Nat Methods. 2023;20: 12–16. doi:10.1038/s41592-022-01716-8 - DOI - PMC - PubMed
    1. Gershman A, Sauria MEG, Guitart X, Vollger MR, Hook PW, Hoyt SJ, et al. Epigenetic patterns in a complete human genome. Science. 2022;376: eabj5089. doi:10.1126/science.abj5089 - DOI - PMC - PubMed
    1. Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods. 2015;12: 733–735. doi:10.1038/nmeth.3444 - DOI - PubMed

Publication types