Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 10;53(7):gkaf277.
doi: 10.1093/nar/gkaf277.

Analysis of RNA translation with a deep learning architecture provides new insight into translation control

Affiliations

Analysis of RNA translation with a deep learning architecture provides new insight into translation control

Xiaojuan Fan et al. Nucleic Acids Res. .

Abstract

Accurate annotation of coding regions in RNAs is essential for understanding gene translation. We developed a deep neural network to directly predict and analyze translation initiation and termination sites from RNA sequences. Trained with human transcripts, our model learned hidden rules of translation control and achieved a near perfect prediction of canonical translation sites across entire human transcriptome. Surprisingly, this model revealed a new role of codon usage in regulating translation termination, which was experimentally validated. We also identified thousands of new open reading frames in mRNAs or lncRNAs, some of which were confirmed experimentally. The model trained with human mRNAs achieved high prediction accuracy of canonical translation sites in all eukaryotes and good prediction in polycistronic transcripts from prokaryotes or RNA viruses, suggesting a high degree of conservation in translation control. Collectively, we present TranslationAI (https://www.biosino.org/TranslationAI/), a general and efficient deep learning model for RNA translation that generates new insights into the complexity of translation regulation.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Graphical Abstract
Graphical Abstract
Figure 1.
Figure 1.
Construction of deep learning network for translation prediction. (A). Flowchart of TranslationAI, a computational model for predicting TIS/TTS with full length mRNA. For each position in the full length mRNA, TranslationAI-2k takes 60, 200, 600, and 2000 nucleotides of flanking sequence as input, and predicts whether that position corresponds to a TIS, TTS, or neither. (B). Effect of input sequence context size on network accuracy. PR-AUC is the area under the precision-recall curve. The figure shows the performance of the network at four different input sequence context sizes. (C). Relationship between transcript length and the positive rate of TIS/TTS for each transcript by comparing the number of positive TISs from TranslationAI-200 or TranslationAI-2k to the total number of transcripts with lengths shorter than the given transcript. The distribution of transcript length is shown in the background as a histogram. (D). Accumulative distribution of TIS/TTS score among all mRNAs. (E). The number of ORFs predicted by TranslationAI-2k using mRNA and shuffled mRNA (mononucleotide shuffling) as input and different cutoff of TIS scores. The average number of ORFs predicted from multiple shuffling, along with error bars representing three times the standard deviation calculated from shuffled sequences. (FI). The features learned by TranslationAI. Systematic in silico perturbations on different regions of the mRNA, measured as changes (formula image score) in probability scores for the authentic TISs/TTSs. The perturbations include: (B), replacement of TIS/TTS identity; (G), changes in UTR length and sequence; (H), frameshifts by deleting one, two, or three nucleotides; (I), mutations in CDSs. Blue box: ORF, black line: 5′-UTR and 3′-UTR, green line: TIS, pink line: TTS. Mononucleotide shuffling was performed for shuffling of 5′-UTR and 3′-UTR. Codon shuffling was performed for triplet shuffle, which maintained the amino acid composition. CAI: Codon Adaptation Index.
Figure 2.
Figure 2.
Evaluation of predictive features in TranslationAI. (A). Conservation analysis of all TIS/TTS, strong TIS/TTS, and weak TIS/TTS regions. P-values were calculated using Mann–Whitney U test. (B). Metagene analysis and (C). boxplot of reads density on transcripts containing strong and weak TIS/TTS (−20 nt : +20 nt) in iPSC cell line. P-values were calculated using Mann–Whitney U test. (D). Motif analysis of strong and weak TIS/TTS. (E). The codon distribution at the -30nt position of stop codon. The ratios of the same amino acid with C/G and A/T at the third position of each codon before strong and weak stop codons was quantified. P-values were calculated using standard T-test. (F). Validation of motifs regulating translation termination. Known readthrough stop codon from VDR and their mutations (without change of amino acids) were tested for translation readthrough. No_stop: stop codon of VDR (TGA) was mutated to CGA; strong TTS1: stop codon of VDR was mutated to a strong TTS by changing the upstream 27nt and downstream 12nt sequences; strong TTS2: stop codon of VDR was mutated to a strong TTS by changing the upstream 27nt sequence; weak TTS: stop codon of VDR was mutated to a weak TTS by changing the upstream 27 nt. Two six degenerative codons (Leucine and Serine) and one 4 degenerative codon (Alanine) were validated by the same experiment. The readthrough efficiency was measured by the Nluc activity that is normalized by the Fluc transfection controls (n = 3, mean ± SD, P-value was calculated by Student's t-test). The heatmap shows codon frequencies for Serine, Leucine, and Alanine, with annotations consistent with Supplementary Fig. S3E. G. Boxplot analysis of reads density on 3′-UTRs from transcripts with strong/weak TTS in iPSC cell line and iPSC-induced cardiomyocyte cell line, respectively. (H). Normalized read density (by abundance of the transcript) at the last P-site before stop codon of in vitro platelet-like particles (PLP) and PLP treated with high salt.
Figure 3.
Figure 3.
Identification of non-canonical ORFs in human transcriptome. (A). Flowchart for identifying non-canonical ORFs, including upstream ORFs, downstream ORFs, dual coding ORFs, and new ORFs from ncRNAs. (B). Example ribosome footprints of a uORF from ERVMER-1 and dORF from AAK1. (C). The ratio of various types of predicted TISs identified by another published dataset derived from ribosome profiling assays. (D). The number of newly identified translatable ncRNA in annotated lncRNAs, processed transcripts and other transcripts without known ORFs. (E). The number of ORFs predicted by TranslationAI-2k using ncRNA and shuffled ncRNA sequences as input. The average numbers of ORFs predicted from lncRNAs and shuffled lncRNAs (mononucleotide shuffling, as control) were shown, with error bars representing 3 ×standard deviation calculated from shuffled sequences. (F). Metagene analysis of translatable lncRNAs (green line) and control (grey line, mRNAs with the same ORF length distribution of predicted ORFs from lncRNAs). (G). The MS/MS spectra of peptides from two ncRNAs: lncRNA ENST00000609975 (APRSSGPRM) and antisense RNA ENST00000413405 (FTDDTFDPELAATIGTNLR). The annotated b- and y-ions are marked in red and green, respectively.
Figure 4.
Figure 4.
TranslationAI accurately predicts TIS/TTS of eukaryotes, prokaryotes, and viruses. (A). The AUC, PR-AUC, and prediction accuracy of TIS prediction across tested eukaryotes (Human, Mouse, Zebrafish, Drosophila, Arabidopsis, and budding yeast S. cerevisiae). The predictions of two previous models are also included as a comparison. (B-C). The prediction of TISs/TTSs on Ebola genomic or RNA sequences (B). and SARS-CoV-2 genomic or subgenomic sequences (C). The upper scales represent the genomic sequence position.

Update of

Similar articles

Cited by

References

    1. Frankish A, Diekhans M, Ferreira AM et al. . GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019; 47:D766–73.10.1093/nar/gky955. - DOI - PMC - PubMed
    1. Mouilleron H, Delcourt V, Roucou X Death of a dogma: eukaryotic mRNAs can code for more than one protein. Nucleic Acids Res. 2016; 44:14–23.10.1093/nar/gkv1218. - DOI - PMC - PubMed
    1. Hinnebusch AG Molecular mechanism of scanning and start codon selection in eukaryotes. Microbiol Mol Biol Rev. 2011; 75:434–67.10.1128/MMBR.00008-11. - DOI - PMC - PubMed
    1. McNair K, Ecale Zhou CL, Souza B et al. . Utilizing amino acid composition and entropy of potential open reading frames to identify protein-coding genes. Microorganisms. 2021; 9:129.10.3390/microorganisms9010129. - DOI - PMC - PubMed
    1. Jackson R, Kroehling L, Khitun A et al. . The translation of non-canonical open reading frames controls mucosal immunity. Nature. 2018; 564:434–8.10.1038/s41586-018-0794-7. - DOI - PMC - PubMed

LinkOut - more resources