Enhanced detection of RNA modifications and read mapping with high-accuracy nanopore RNA basecalling models

Gregor Diensthuber^#^{1

2}, Leszek P Pryszcz^#¹, Laia Llovera¹, Morghan C Lucas^{1

2}, Anna Delgado-Tejedor^{1

2}, Sonia Cruciani^{1

2}, Jean-Yves Roignant^{3

4}, Oguzhan Begik⁵, Eva Maria Novoa^{5

2}

Affiliations

¹ Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain.
² Universitat Pompeu Fabra, Barcelona 08003, Spain.
³ Center for Integrative Genomics, Faculty of Biology and Medicine, University of Lausanne, 1015 Lausanne, Switzerland.
⁴ Institute of Pharmaceutical and Biomedical Sciences, Johannes Gutenberg-University Mainz, 55128 Mainz, Germany.
⁵ Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain; oguzhan.begik@crg.eu eva.novoa@crg.eu.

^# Contributed equally.

PMID: 39271295
PMCID: PMC11610583
DOI: 10.1101/gr.278849.123

Enhanced detection of RNA modifications and read mapping with high-accuracy nanopore RNA basecalling models

Gregor Diensthuber et al. Genome Res. 2024.

. 2024 Nov 20;34(11):1865-1877.

doi: 10.1101/gr.278849.123.

Authors

Affiliations

¹ Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain.
² Universitat Pompeu Fabra, Barcelona 08003, Spain.
³ Center for Integrative Genomics, Faculty of Biology and Medicine, University of Lausanne, 1015 Lausanne, Switzerland.
⁴ Institute of Pharmaceutical and Biomedical Sciences, Johannes Gutenberg-University Mainz, 55128 Mainz, Germany.
⁵ Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain; oguzhan.begik@crg.eu eva.novoa@crg.eu.

^# Contributed equally.

PMID: 39271295
PMCID: PMC11610583
DOI: 10.1101/gr.278849.123

Abstract

In recent years, nanopore direct RNA sequencing (DRS) became a valuable tool for studying the epitranscriptome, owing to its ability to detect multiple modifications within the same full-length native RNA molecules. Although RNA modifications can be identified in the form of systematic basecalling "errors" in DRS data sets, N6-methyladenosine (m⁶A) modifications produce relatively low "errors" compared with other RNA modifications, limiting the applicability of this approach to m⁶A sites that are modified at high stoichiometries. Here, we demonstrate that the use of alternative RNA basecalling models, trained with fully unmodified sequences, increases the "error" signal of m⁶A, leading to enhanced detection and improved sensitivity even at low stoichiometries. Moreover, we find that high-accuracy alternative RNA basecalling models can show up to 97% median basecalling accuracy, outperforming currently available RNA basecalling models, which show 91% median basecalling accuracy. Notably, the use of high-accuracy basecalling models is accompanied by a significant increase in the number of mapped reads-especially in shorter RNA fractions-and increased basecalling error signatures at pseudouridine (Ψ)- and N1-methylpseudouridine (m¹Ψ)-modified sites. Overall, our work demonstrates that alternative RNA basecalling models can be used to improve the detection of RNA modifications, read mappability, and basecalling accuracy in nanopore DRS data sets.

PubMed Disclaimer

Figures

**Figure 1.**
Benchmarking RNA basecalling models and their effect on basecalling accuracy and RNA modification detection. (A) Schematic overview of the in vitro “curlcakes” used in this work, depicting the seven different modification types that were sequenced (m⁶A, m⁵C, hm⁵C, ac⁴C, Ψ, m¹Ψ, and m⁵U) and the three RNA basecalling models benchmarked (default, IVT, and SUP). To assess the performance of tested RNA basecalling models, error signatures (mismatch, deletion, and insertion frequency) were used. (B) Comparison of basecalling accuracies obtained from each RNA basecalling model on human transcriptomic data (see Methods). The curve represents the density of the per-read identities for each model. Solid lines represent Guppy, and dashed lines represent Dorado as basecalling software. We would like to note that a subset of reads from the HEK293T samples used here was also used for training our “SUP” model (see Methods); however, see also Figure 4A, which shows similar accuracy for SUP model in other species/data sets not included in model training. (C) IGV snapshots showing the effect of using either default, IVT, or SUP models on unmodified (UNM) or modified (m⁶A, Ψ, and m¹Ψ) “curlcake” sequences for two positions each (*top* and *bottom* row). Asterisks indicate modified sites. Positions at which the mismatch frequency exceeds 0.2 are colored, and deletions are visualized as drop in coverage. (D) Comparison of the summed errors (mismatch, deletion, and, insertion frequency) for all 5-mers with a central modified base for the three tested basecalling models (default, IVT, and SUP). (E) Box plots showing the delta summed error for 5-mers of selected modifications (m⁶A, Ψ, and m¹Ψ) at the central (modified) position. For equivalent plots of the remaining RNA modifications tested (m⁵C, hm⁵C, ac⁴C, and m⁵U), see Supplemental Figure S2B. Statistical analysis was performed using a two-sided nonparametric Wilcoxon test with “default” as the reference group. Results were corrected for multiple-hypothesis testing using the Benjamini–Hochberg procedure to obtain adjusted P-values: (ns) P > 0.05, (*) P ≤ 0.05, (**) P ≤ 0.01, (***) P ≤ 0.001, (****) P ≤ 0.0001. The sample size reported by n represents the number of 5-mers contributing to each box plot per panel. For D and E, the box is limited by the lower-quartile Q1 (*bottom*) and upper-quartile Q3 (*top*). Whiskers are defined as 1.5 × IQR with outliers represented as individual dots. (F) Comparison of the median delta summed error (=SumErr_MOD − SumErr_UNM) for centrally modified (position 0) 5-mers. A Z-score normalization per modification (see Methods) was performed to highlight the effect that different basecalling models have on the obtained delta summed error. Unscaled results are reported in Supplemental Figure S2B.

**Figure 2.**
The effect of custom basecalling models on biologically relevant m⁶A motifs and at different m⁶A stoichiometries. (A) Comparison of the delta summed errors (ΔSumErr = SumErr_MOD − SumErr_UNM) at the central modified position for 5-mers that fall within the DRACH motif (D = A, G, or U; R = G or A; H = A, C, or U) and others (non-DRACH). Statistical analysis was performed using a two-sided nonparametric Wilcoxon test with “default” as the reference group. (B) Comparison of the ΔSumErr for individual motifs that are DRACH and non-DRACH. Each dot represents a single observation of that 5-mer within the “curlcake” sequences. To compare the relationship between individual 5-mers, a paired two-sided Wilcoxon test was performed. (C) Schematic representation of the experimental approach to obtain “curlcakes” at different m⁶A stoichiometries. ATP and m⁶ATP were mixed at different ratios and used for the in vitro transcription of “curlcakes” with the remaining NTPs, followed by poly(A)-tailing, library preparation, and direct RNA sequencing (DRS). (D) Comparison of error-signature amplitude at different modification stoichiometries between 5-mers that fall within the DRACH motif. The corresponding plot for non-DRACH motifs is reported in Supplemental Figure S5B. Lines were fitted using locally estimated scatterplot smoothing (loess), and lightly shaded areas correspond to the 95% confidence interval. For A,B the box is limited by the lower-quartile Q1 (*bottom*) and upper-quartile Q3 (*top*). Whiskers are defined as 1.5 × IQR with outliers represented as individual dots. All statistical tests were corrected for multiple hypothesis testing using the Benjamini–Hochberg correction, and adjusted P-values are displayed: (ns) P > 0.05, (*) P ≤ 0.05, (**) P ≤ 0.01.

**Figure 3.**
Detection of m⁶A-modified sites from HEK293T cells using a pairwise comparison approach and validation of identified sites with orthogonal methods. (A) Schematic representation of the workflow to obtain m⁶A-modified sites using DRS. Samples from wild-type HEK293T and *METTL3*^−/− knockout cells were sequenced using DRS. Raw FAST5 files were basecalled using one of the three models tested (default, IVT, and SUP) and mapped to the human reference genome (hg38). To obtain m⁶A-modified sites, pairwise comparisons within each model were performed using ELIGOS2 (Jenjaroenpun et al. 2021). (B) Venn diagrams showing the overlap in absolute numbers between the nanopore used with either the default or IVT model and m6ACE-seq, miCLIP, and GLORI-seq (Pratanwanich et al. 2021; Liu et al. 2023). Bar plots show the relative amount of sites supported. (C) Metagene plot of sites identified by either the default or IVT model show the known pattern of m⁶A distribution with a strong enrichment around the stop codon. Corresponding motifs identified de novo for default (motif AvRec = 0.35) and IVT (motif AvRec = 0.401), using BaMM. (D) Scatterplot depicting the delta summed error (=SumErr_WT − SumErr_KO) of m⁶A sites predicted with ELIGOS2 using default and IVT models, and their corresponding absolute m⁶A-levels as reported by GLORI-seq. Individual points represent single m⁶A sites. Lines represent models fit using a generalized additive model (GAM).

**Figure 4.**
SUP is a super high accuracy model that leads to improved read accuracy and mapping. (A) Cross species comparison of read accuracies obtained from each model. Reads were aligned to the respective reference transcriptome (see Supplemental Table S5). To account for vastly different expression levels that would bias the analysis toward strongly expressed genes, we filtered for the longest reads of each transcript, keeping at most 10 reads per transcript. (B) Comparison of reads mapped to the human reference transcriptome (n = 3) across three bins of read lengths. Error bars indicate ± 1 SD. Statistical significance was determined using a two-sided t-test, corrected for multiple hypothesis testing using the Benjamini–Hochberg procedure: (ns) P ≥ 0.05, (*) P < 0.05). (C) Effect of distinct basecalling models on the proportion of mapped reads obtained from standard poly(A)-selected human samples (median length ≅ 850 bp) and a synthetic RNA library consisting of three polyadenylated in vitro transcripts (median length ≅ 200 nt). Statistical significance was determined using a one-sided t-test, and results were corrected for multiple hypothesis testing using the Benjamini–Hochberg procedure: (ns) P ≥ 0.05, (*) P < 0.05, (**) P < 0.01. Box plots depict the underlying size distribution of the sequenced libraries. The box is limited by the lower-quartile Q1 (*bottom*) and upper-quartile Q3 (*top*). Whiskers are defined as 1.5 × IQR with outliers being removed.

**Figure 5.**
SUP improves the identification of m¹Ψ-modified residues in synthetic mRNA vaccines. (A) IGV screenshot showing the synthetic mRNA vaccine containing either unmodified uridine (U) or N1-methylpseudouridine (m¹Ψ). The 20,000 longest, uniquely mapped reads were selected for this analysis. Positions at which the mismatch frequency exceeds 0.2 are colored, and those with a mismatch frequency below 0.2 are shown in gray. (B) Comparison of mismatch frequency across all m¹Ψ sites (n = 170) found in the synthetic eGPF vaccine. To test for statistical significance, the nonparametric Wilcoxon-test was used, and values were corrected for multiple hypothesis testing using the Benjamini–Hochberg procedure: (ns) P > 0.05, (*) P ≤ 0.05, (**) P ≤ 0.01, (***) P ≤ 0.001, (****) P ≤ 0.0001. (C) Comparison of U > C mismatch frequency for all 170 m¹Ψ sites between the default and SUP. Points located in the upper half of the dotted line represent sites with higher U > C frequency in SUP (n = 151), and points below the dotted line represent sites with higher U > C frequency in the default. (n = 19). Points are colored by the absolute difference between the default and SUP. (D) Comparison of U > C mismatch frequency across the most lowly modified sites in the default (median = 3.45) compared with the SUP (median = 19.43). A paired two-sided Wilcoxon test was performed to test for statistical significance. For B,D, the box is limited by the lower-quartile Q1 (*bottom*) and upper-quartile Q3 (*top*). Whiskers are defined as 1.5 × IQR with outliers removed.

See this image and copyright information in PMC

References

1. Abebe JS, Price AM, Hayer KE, Mohr I, Weitzman MD, Wilson AC, Depledge DP. 2022a. DRUMMER-rapid detection of RNA modifications through comparative nanopore sequencing. Bioinformatics 38: 3113–3115. 10.1093/bioinformatics/btac274 - DOI - PMC - PubMed
1. Abebe JS, Verstraten R, Depledge DP. 2022b. Nanopore-based detection of viral RNA modifications. mBio 13: e0370221. 10.1128/mbio.03702-21 - DOI - PMC - PubMed
1. Acera Mateos P, Sethi AJ, Ravindran A, Srivastava A, Woodward K, Mahmud S, Kanchi M, Guarnacci M, Xu J, Yuen ZWS, et al. 2024. Prediction of m6A and m5C at single-molecule resolution reveals a transcriptome-wide co-occurrence of RNA modifications. Nat Commun 15: 3899. 10.1038/s41467-024-47953-7 - DOI - PMC - PubMed
1. Begik O, Lucas MC, Pryszcz LP, Ramirez JM, Medina R, Milenkovic I, Cruciani S, Liu H, Vieira HGS, Sas-Chen A, et al. 2021. Quantitative profiling of pseudouridylation dynamics in native RNAs with nanopore sequencing. Nat Biotechnol 39:1278–1291. 10.1038/s41587-021-00915-6 - DOI - PubMed
1. Begik O, Mattick JS, Novoa EM. 2022. Exploring the epitranscriptome by native RNA sequencing. RNA 28: 1430–1439. 10.1261/rna.079404.122 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Enhanced detection of RNA modifications and read mapping with high-accuracy nanopore RNA basecalling models

Affiliations

Enhanced detection of RNA modifications and read mapping with high-accuracy nanopore RNA basecalling models

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases