Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jul 2:2023.07.08.548206.
doi: 10.1101/2023.07.08.548206.

Analysis of RNA translation with a deep learning architecture provides new insight into translation control

Affiliations

Analysis of RNA translation with a deep learning architecture provides new insight into translation control

Xiaojuan Fan et al. bioRxiv. .

Update in

Abstract

Accurate annotation of coding regions in RNAs is essential for understanding gene translation. We developed a deep neural network to directly predict and analyze translation initiation and termination sites from RNA sequences. Trained with human transcripts, our model learned hidden rules of translation control and achieved a near perfect prediction of canonical translation sites across entire human transcriptome. Surprisingly, this model revealed a new role of codon usage in regulating translation termination, which was experimentally validated. We also identified thousands of new open reading frames in mRNAs or lncRNAs, some of which were confirmed experimentally. The model trained with human mRNAs achieved high prediction accuracy of canonical translation sites in all eukaryotes and good prediction in polycistronic transcripts from prokaryotes or RNA viruses, suggesting a high degree of conservation in translation control. Collectively, we present a general and efficient deep learning model for RNA translation, generating new insights into the complexity of translation regulation.

PubMed Disclaimer

Conflict of interest statement

Declaration of Interests The authors declare no competing financial interests.

Figures

Figure 1.
Figure 1.. Construction of deep learning network for translation prediction.
A. Flowchart of TranslationAI, a computational model for predicting translation initiation and termination sites with full length mRNA. For each position in the full length mRNA, TranslationAI-2k takes 60, 200, 600, and 2,000 nucleotides of flanking sequence as input, and predicts whether that position corresponds to a translation initiation site (TIS), translation termination site (TTS), or neither. B. Effect of input sequence context size on network accuracy. PR-AUC is the area under the precision-recall curve. The figure shows the performance of the network at four different input sequence context sizes. C. Relationship between transcript length and the positive rate of TIS/TTS for each transcript by comparing the number of positive TISs from TranslationAI-200 (purple line) or TranslationAI-2k (red line) to the total number of transcripts with lengths shorter than the given transcript. The distribution of transcript length is shown in the background as a histogram. D. Accumulative distribution of TIS/TTS score among all mRNAs. E. The number of ORFs predicted by TranslationAI-2k using mRNA and shuffled mRNA (mononucleotide shuffling) as input and different cutoff of TIS scores. The average number of ORFs predicted from multiple shuffling, along with error bars representing three times the standard deviation calculated from shuffled sequences. F-I. The features learned by TranslationAI. Systematic in silico perturbations on different regions of the mRNA, measured as changes (Δ score) in probability scores for the authentic TISs/TTSs. The perturbations include: F, replacement of TIS/TTS identity; G, changes in UTR length and sequence; H, frameshifts by deleting one, two, or three nucleotides; I, mutations in coding sequences. Blue box: ORF (open reading frame), black line: 5′-UTR and 3′-UTR, green line: TIS, pink line: TTS. Mononucleotide shuffling was performed for shuffling of 5′-UTR and 3′-UTR. Codon shuffling was performed for triplet shuffle, which maintained the amino acid composition. CAI: Codon Adaptation Index.
Figure 2.
Figure 2.. Evaluation of predictive features in TranslationAI.
A. Conservation analysis of all TIS/TTS, strong TIS/TTS, and weak TIS/TTS regions. P-values were calculated using Mann–Whitney U test. B. Metagene analysis and C. boxplot of reads density on transcripts containing strong (red) and weak (blue) TIS/TTS (−20nt : +20nt) in iPSC cell line. P-values were calculated using Mann–Whitney U test. D. Motif analysis of strong and weak TIS/TTS. E. The codon distribution at the −30nt position of stop codon. The ratios of the same amino acid with C/G and A/T at the third position of each codon before strong and weak stop codons was quantified. P-values were calculated using standard T-test. F. Validation of motifs around strong TTS. Known readthrough stop codon from VDR and their mutations (without change of amino acids) were tested for translation readthrough. No_stop: stop codon of VDR (TGA) was mutated to CGA; strong TTS1: stop codon of VDR was mutated to a strong TTS by changing the upstream 27nt and downstream 12nt sequences; strong TTS2: stop codon of VDR was mutated to a strong TTS by changing the upstream 27nt sequence; weak TTS: stop codon of VDR was mutated to a weak TTS by changing the upstream 27nt. See supplementary information in Table S3. G. Boxplot analysis of reads density on 3′-UTRs from transcripts with strong/weak TTS in iPSC cell line and iPSC-induced Cardiomyocyte cell line, respectively. H. Normalized read density (by abundance of the transcript) at the last P-site before stop codon of in vitro platelet-like particles (PLP) and PLP treated with high salt.
Figure 3.
Figure 3.. Identification of non-canonical ORFs in human transcriptome.
A. Flowchart for identifying non-canonical ORFs, including upstream ORFs, downstream ORFs, dual coding ORFs, and new ORFs from non-coding RNAs. B. Example ribosome footprints of a uORF from ERVMER-1 and dORF from AAK1. C. The ratio of various types of predicted TISs identified by another published dataset derived from ribosome profiling assays. D. The number of newly identified translatable ncRNA in annotated lncRNAs, processed transcripts and other transcripts without known ORFs. E. The number of ORFs predicted by TranslationAI-2k using non-coding RNA and shuffled non-coding RNA sequences as input. The average numbers of ORFs predicted from lncRNAs and shuffled lncRNAs (mononucleotide shuffling, as control) were shown, with error bars representing 3×standard deviation calculated from shuffled sequences. F. Metagene analysis of translatable lncRNAs (green line) and control (grey line, mRNAs with the same ORF length distribution of predicted ORFs from lncRNAs). G. The MS/MS spectra of peptides from two ncRNAs: lncRNA ENST00000609975 (APRSSGPRM) and antisense RNA ENST00000413405 (FTDDTFDPELAATIGTNLR). The annotated b- and y-ions are marked in red and green, respectively.
Figure 4.
Figure 4.. TranslationAI accurately predicts TIS/TTS of eukaryotes, prokaryotes, and viruses.
A. The AUC, PR-AUC, and prediction accuracy of TIS prediction across tested eukaryotes (Human, Mouse, Zebrafish, Drosophila, Arabidopsis, and budding yeast S. cerevisiae). The predictions of two previous models are also included as a comparison. B-C. The prediction of TISs/TTSs on Ebola genomic or RNA sequences B. and SARS-CoV-2 genomic or subgenomic sequences C. The upper scales represent the genomic sequence position, the blue boxes indicate the annotated ORFs, and the white boxes indicate the newly predicted out-of-frame ORFs. The green triangles and red lines indicate the predicted in-frame TISs and TTSs, respectively.

References

    1. Frankish A., Diekhans M., Ferreira A.M., Johnson R., Jungreis I., Loveland J., Mudge J.M., Sisu C., Wright J., Armstrong J. et al. (2019) GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res, 47, D766–D773. - PMC - PubMed
    1. Mouilleron H., Delcourt V. and Roucou X. (2016) Death of a dogma: eukaryotic mRNAs can code for more than one protein. Nucleic Acids Res, 44, 14–23. - PMC - PubMed
    1. Hinnebusch A.G. (2011) Molecular mechanism of scanning and start codon selection in eukaryotes. Microbiol Mol Biol Rev, 75, 434–467, first page of table of contents. - PMC - PubMed
    1. McNair K., Ecale Zhou C.L., Souza B., Malfatti S. and Edwards R.A. (2021) Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes. Microorganisms, 9. - PMC - PubMed
    1. Jackson R., Kroehling L., Khitun A., Bailis W., Jarret A., York A.G., Khan O.M., Brewer J.R., Skadow M.H., Duizer C. et al. (2018) The translation of non-canonical open reading frames controls mucosal immunity. Nature, 564, 434–438. - PMC - PubMed

Publication types

LinkOut - more resources