Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 8;25(1):258.
doi: 10.1186/s13059-024-03397-2.

DEMINING: A deep learning model embedded framework to distinguish RNA editing from DNA mutations in RNA sequencing data

Affiliations

DEMINING: A deep learning model embedded framework to distinguish RNA editing from DNA mutations in RNA sequencing data

Zhi-Can Fu et al. Genome Biol. .

Abstract

Precise calling of promiscuous adenosine-to-inosine RNA editing sites from transcriptomic datasets is hindered by DNA mutations and sequencing/mapping errors. Here, we present a stepwise computational framework, called DEMINING, to distinguish RNA editing and DNA mutations directly from RNA sequencing datasets, with an embedded deep learning model named DeepDDR. After transfer learning, DEMINING can also classify RNA editing sites and DNA mutations from non-primate sequencing samples. When applied in samples from acute myeloid leukemia patients, DEMINING uncovers previously underappreciated DNA mutation and RNA editing sites; some associated with the upregulated expression of host genes or the production of neoantigens.

Keywords: AML; DNA mutation; Deep learning; IDR; Neoantigens; RNA editing; RNA-seq; Transfer learning.

PubMed Disclaimer

Conflict of interest statement

F.N. and L.Y. have filed a patent application (202310642373.8) relating to this work through Children’s Hospital of Fudan University. However, the patent does not restrict the educational, research, and not-for-profit purposes. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Developing DEMINING embedded DeepDDR model for DNA mutations (DMs) and RNA editing sites (REs) classification. a Construction of a stepwise DEMINING computational framework for direct DNA mutation (DM) and RNA editing (RE) classification. HPB hits per billion mapped bases, MF mutation frequency, MR mutation read. See the “Methods” section for details. b Schematic diagram of an embedded DeepDDR model for DM and RE classification. Left, features extract strategy by the co-occurrence frequencies of each mutation site with its context bases (CMC). Right, DeepDDR model architecture.  See the “Methods” section for details. c Evaluation of different models on RE identification. Receiver operating characteristic (ROC, left) curves and precision recall curves (PRC, right) of DeepDDR (red), EditPredict (purple), and RED-ML (blue) were shown to indicate their performance on RE identification with the test set. Area under ROC (AUROC) and area under PRC (AUPRC) values of DeepDDR (red), EditPredict (purple), and RED-ML (blue) were included in the figure. d Evaluation of DeepDDR on DM identification. ROC (left) and PRC (right) of DeepDDR were shown to indicate its performance on DM identification with the test set. AUROC and AUPRC values of DeepDDR were included in the figure
Fig. 2
Fig. 2
Model performance on human independent test set. a Prediction of DMs and REs on independent test set with DEMINING/DeepDDR model. True DMs and REs were classified from paired WGS and RNA-seq data. DEMINING/DeepDDR was used to predict DMs and REs from training independent RNA-seq data of sample with SampleID HG00145. See the “Methods” section for details. b Evaluation metrics for RE identification on HG00145, including accuracy, precision, recall, specificity, and F1 score, comparing DeepDDR (red bar), EditPredict (purple bar) and RED-ML (blue bar). c Evaluation of different models on RE identification on HG00145. Receiver operating characteristic (ROC, left) curves and precision recall curves (PRC, right) of DeepDDR (red), EditPredict (purple), and RED-ML (blue) were shown to indicate their performance on RE identification. Area under ROC (AUROC) and area under PRC (AUPRC) values of DeepDDR (red), EditPredict (purple), and RED-ML (blue) were included in the figure. d Evaluation metrics for DM identification on HG00145, including accuracy, precision, recall, specificity, and F1 score, for DeepDDR. e Evaluation of DeepDDR on DM identification on HG00145. ROC (left) and PRC (right) of DeepDDR were shown to indicate its performance on DM identification. AUROC and AUPRC values of DeepDDR were included in the figure
Fig. 3
Fig. 3
Model trained on human data accurately classify DMs and REs on mouse data using transfer learning. a Prediction of DMs and REs in mouse datasets with original DeepDDR model. DeepDDR was used to predict DMs and REs from WT RNA-seq data of mouse bone marrow, and true DMs and REs were classified by comparing WT and Adar KO RNA-seq data. See the “Methods” section for details. b Evaluation of different models on RE identification. ROC of DeepDDR (red) and EditPredict (purple) were shown to indicate their performance on RE identification with the mouse bone marrow dataset. AUROC values of DeepDDR (red) and EditPredict (purple) were included in the figure. Of note, since the features used by RED-ML only designed to be extracted from the human genome, RED-ML was failed to be included in this and below comparisons. c Evaluation metrics for RE identification (left) and DM identification (right), including accuracy, precision, recall, specificity, and F1 score, by DeepDDR (red) and EditPredict (purple). d Schematic of constructing DeepDDR-transfer model. An additional mouse brain RNA-seq datasets containing WT and Adars (Adar and Adarb1) double knockout (DKO) samples were used for transfer learning. See the “Methods” section for details. e Evaluation of different models after transfer learning on RE identification. ROC of DeepDDR-transfer (red dashed line) and EditPredict-transfer (purple dashed line) were shown to indicate their performance on RE identification with the same mouse bone marrow dataset. AUROC values of DeepDDR-transfer (red) and EditPredict-transfer (purple) were included in the figure. f Evaluation metrics for RE identification (left) and DM identification (right), including accuracy, precision, recall, specificity, and F1 score, by DeepDDR-transfer (red shaded bar) and EditPredict-transfer (purple shaded bar)
Fig. 4
Fig. 4
Applying DEMINING framework to identify disease-related mutations in acute myeloid leukemia (AML). a Identification of AML-associated DMs from corresponding RNA-seq datasets by DEMINING. b Overlapping of AML-specific DMs and reported SNVs in public databases, including ClinVar 2023.04 (https://www.ncbi.nlm.nih.gov/clinvar/), COSMIC (version 97, https://cancer.sanger.ac.uk/cosmic/) and dbSNP (version 156, https://www.ncbi.nlm.nih.gov/snp/). c Mutation frequency distribution of all AML-specific DMs (left), overlapped AML-specific DMs (middle), and non-overlapped AML-specific DMs (right). d Overlapping of 4464 mutated genes carrying AML-specific recoding DMs with 50 AML-associated genes listed in COSMIC Cancer Gene Consensus (CGC). e Gene Ontology (GO) enrichment analysis in biological process (BP) terms for three gene sets including all mutated 4464 genes, 86 AML-associated genes listed in CGC, and their overlapping 50 genes. Top GO terms ordered by adjusted P value in at least one gene set were kept and compared. f The number of overlapped (dark gray) and non-overlapped (light gray) DMs in the top 10 genes. Left, top 10 genes out of 4464 genes with recoding DMs; right, top 10 genes out of 50 AML-associated genes listed in COSMIC CGC
Fig. 5
Fig. 5
AML-specific DMs identified in three ANKRD genes are enriched in IDR coding regions and correlated with expression. a Distribution of identified AML-specific recoding DMs (top) and AML-specific DMs (bottom) along coding sequences (CDS) of the ANKRD36C (left), ANKRD36 (middle), and ANKRD36B (right). Distribution (black plot) of predicted DMs (black vertical lines) by DEMINING were shown in genes’ CDS regions (gray and red rectangles). Red rectangles represent CDS regions that encoding intrinsically disordered regions (IDRs) predicted by MobiDB-lite integrated in the InterPro database (https://www.ebi.ac.uk/interpro/). b Comparison of gene expression of ANKRD36C, ANKRD36, and ANKRD36B in 17 normal control samples (Ctrl) and 19 AML patients with or without recoding DMs (AML w DM: AML patients with recoding DM; AML w/o DM: AML patients without recoding DM). The boxplot summarizes results for all samples with the number of samples n shown below. Center line: median. Box bottom and top edges: 25th and 75th percentiles. Whiskers extend to extreme points excluding outliers (1.5 times above or below the interquartile range). Outliers omitted for clarity. Violin-shaped areas: Kernel density estimate of data distribution. Statistical significance was assessed with two-tailed Wilcoxon rank-sum test. *P < 0.05, **P < 0.01, ***P < 0.001

Similar articles

Cited by

References

    1. Chen LL, Yang L. ALUternative regulation for gene expression. Trends Cell Biol. 2017;27:480–90. - PubMed
    1. Genomes Project C, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. A global reference for human genetic variation. Nature. 2015;526:68–74. - PMC - PubMed
    1. Ramaswami G, Li JB. RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res. 2014;42:D109-113. - PMC - PubMed
    1. Wang C, Davila JI, Baheti S, Bhagwate AV, Wang X, Kocher JP, Slager SL, Feldman AL, Novak AJ, Cerhan JR, et al. RVboost: RNA-seq variants prioritization using a boosting method. Bioinformatics. 2014;30:3414–6. - PMC - PubMed
    1. Piskol R, Ramaswami G, Li JB. Reliable identification of genomic variants from RNA-seq data. Am J Hum Genet. 2013;93:641–51. - PMC - PubMed

LinkOut - more resources