Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 13;2(1):lqz024.
doi: 10.1093/nargab/lqz024. eCollection 2020 Mar.

RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences

Affiliations

RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences

Antonio P Camargo et al. NAR Genom Bioinform. .

Abstract

The advent of high-throughput sequencing technologies made it possible to obtain large volumes of genetic information, quickly and inexpensively. Thus, many efforts are devoted to unveiling the biological roles of genomic elements, being the distinction between protein-coding and long non-coding RNAs one of the most important tasks. We describe RNAsamba, a tool to predict the coding potential of RNA molecules from sequence information using a neural network-based that models both the whole sequence and the ORF to identify patterns that distinguish coding from non-coding transcripts. We evaluated RNAsamba's classification performance using transcripts coming from humans and several other model organisms and show that it recurrently outperforms other state-of-the-art methods. Our results also show that RNAsamba can identify coding signals in partial-length ORFs and UTR sequences, evidencing that its algorithm is not dependent on complete transcript sequences. Furthermore, RNAsamba can also predict small ORFs, traditionally identified with ribosome profiling experiments. We believe that RNAsamba will enable faster and more accurate biological findings from genomic data of species that are being sequenced for the first time. A user-friendly web interface, the documentation containing instructions for local installation and usage, and the source code of RNAsamba can be found at https://rnasamba.lge.ibi.unicamp.br/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(A) In an IGLOO layer, the input sequence is initially processed by an 1D convolutional layer and down-sampled using the max pooling approach. From the resulting matrix, K patches consisting of four random slices are drawn from the matrix and then multiplied by matrix of K learnable weights, producing a high-level representation of the sequence input. (B) From the RNA sequence RNAsamba derives two branches. In the Whole Sequence Branch (B1), the whole transcript nucleotide sequence is fed to two IGLOO layers to create high-level representations of the transcript (N1 and N2). In the Longest ORF Branch (B2), four layers are derived from the extracted ORF sequence: an IGLOO representation of the putative protein (P1), nucleotide k-mer frequencies (F1), amino acid frequencies (A1) and the ORF length (O1). The two branches are weighted by the α parameter and then used to compute the final classification of the transcript.
Figure 2.
Figure 2.
Classification benchmark of six different coding potential calculators. (A) Classifiers performance in four independent test datasets containing human transcripts. CPC2 is outside of the displayed range in the mRNN-Challenge test dataset (75.35%). (B) Classifiers performance in five different species. Values correspond to the area under the precision-recall curve. Pre-trained models provided by the authors of each tool were used.
Figure 3.
Figure 3.
Evaluation of the ability of different tools to detect the coding potential of mouse ORFs with varying degrees of fragmentation.
Figure 4.
Figure 4.
Classification benchmark of six different coding potential calculators in short ORF (sORF) datasets from five different species. Values correspond to the area under the precision-recall curve.
Figure 5.
Figure 5.
Computational performance of RNAsamba, lncRNAnet and mRNN in the FEELnc dataset. (A) Peak memory usage during inference and training. (B) Average inference and training wall time of five independent executions of each algorithm. LncRNAnet does not provide an interface to train new models, thus its training times were not measured. CPU computations were performed with two Intel® Xeon® E5-2420 v2 CPUs and GPU computations were performed with a NVIDIA® Tesla® K80. Inference execution time was measured with the hyperfine tool.

References

    1. Wang Z., Gerstein M., Snyder M.. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009; 10:57–63. - PMC - PubMed
    1. Wang K.C., Chang H.Y.. Molecular mechanisms of long noncoding RNAs. Mol. Cell. 2011; 43:904–914. - PMC - PubMed
    1. Consortium E.P., Dunham I., Kundaje A., Aldred S.F., Collins P.J., Davis C.a., Doyle F., Epstein C.B., Frietze S., Harrow J. et al.. An integrated encyclopedia of DNA elements in the human genome. Nature. 2013; 489:57–74. - PMC - PubMed
    1. Frankish A., Diekhans M., Ferreira A.M., Johnson R., Jungreis I., Loveland J., Mudge J.M., Sisu C., Wright J., Armstrong J. et al.. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019; 47:D766–D773. - PMC - PubMed
    1. Iwakiri J., Hamada M., Asai K.. Bioinformatics tools for lncRNA research. Biochim. Biophys. Acta - Gene Regul. Mech. 2016; 1859:23–30. - PubMed