Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Dec 13;7(4):77.
doi: 10.3390/ncrna7040077.

Common Features in lncRNA Annotation and Classification: A Survey

Affiliations
Review

Common Features in lncRNA Annotation and Classification: A Survey

Christopher Klapproth et al. Noncoding RNA. .

Abstract

Long non-coding RNAs (lncRNAs) are widely recognized as important regulators of gene expression. Their molecular functions range from miRNA sponging to chromatin-associated mechanisms, leading to effects in disease progression and establishing them as diagnostic and therapeutic targets. Still, only a few representatives of this diverse class of RNAs are well studied, while the vast majority is poorly described beyond the existence of their transcripts. In this review we survey common in silico approaches for lncRNA annotation. We focus on the well-established sets of features used for classification and discuss their specific advantages and weaknesses. While the available tools perform very well for the task of distinguishing coding sequence from other RNAs, we find that current methods are not well suited to distinguish lncRNAs or parts thereof from other non-protein-coding input sequences. We conclude that the distinction of lncRNAs from intronic sequences and untranslated regions of coding mRNAs remains a pressing research gap.

Keywords: classification problems; coding sequence; feature extraction; lncRNA; machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
A comparison of the frequencies of commonly used features (blue) and algorithms (orange) applied by different contemporary tools. A majority of the latter rely on open reading frame (ORF) information to make predictions. Other often utilized features include subsequence (k-mer) frequencies and GC content. SVMs and Random Forests dominate the field as the most commonly implemented algorithms. This is not surprising, as they are two of the by far most flexible approaches for nonlinear classification.
Figure 2
Figure 2
Sankey diagrams for the input dataset consisting of 200 randomly chosen lncRNA, 200 coding transcripts, dinucleotide shuffled versions of the latter and 194 randomly chosen sequences from the human genome (hg38) and corresponding annotation (RefSeq database [40] v38) and their assignment to coding or non-coding classes by five high-impact classification tools. All tools were run with standard settings where applicable.

References

    1. Esteller M. Non-coding RNAs in human disease. Nat. Rev. Genet. 2011;12:861–874. doi: 10.1038/nrg3074. - DOI - PubMed
    1. Yao R.W., Wang Y., Chen L.L. Cellular functions of long noncoding RNAs. Nat. Cell Biol. 2019;21:542–551. doi: 10.1038/s41556-019-0311-8. - DOI - PubMed
    1. Engreitz J.M., Haines J.E., Perez E.M., Munson G., Chen J., Kane M., McDonel P.E., Guttman M., Lander E.S. Local regulation of gene expression by lncRNA promoters, transcription and splicing. Nature. 2016;539:452–455. doi: 10.1038/nature20149. - DOI - PMC - PubMed
    1. Marques A.C., Ponting C.P. Intergenic lncRNAs and the evolution of gene expression. Curr. Opin. Genet. Dev. 2014;27:48–53. doi: 10.1016/j.gde.2014.03.009. - DOI - PubMed
    1. Yang G., Lu X., Yuan L. LncRNA: A link between RNA and cancer. Biochim. Biophys. Acta (BBA)-Gene Regul. Mech. 2014;1839:1097–1109. doi: 10.1016/j.bbagrm.2014.08.012. - DOI - PubMed

LinkOut - more resources