Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 7;47(8):e43.
doi: 10.1093/nar/gkz087.

CPPred: coding potential prediction based on the global description of RNA sequence

Affiliations

CPPred: coding potential prediction based on the global description of RNA sequence

Xiaoxue Tong et al. Nucleic Acids Res. .

Abstract

The rapid and accurate approach to distinguish between coding RNAs and ncRNAs has been playing a critical role in analyzing thousands of novel transcripts, which have been generated in recent years by next-generation sequencing technology. Previously developed methods CPAT, CPC2 and PLEK can distinguish coding RNAs and ncRNAs very well, but poorly distinguish between small coding RNAs and small ncRNAs. Herein, we report an approach, CPPred (coding potential prediction), which is based on SVM classifier and multiple sequence features including novel RNA features encoded by the global description. The CPPred can better distinguish not only between coding RNAs and ncRNAs, but also between small coding RNAs and small ncRNAs than the state-of-the-art methods due to the addition of the novel RNA features. A recent study proposes 1335 novel human coding RNAs from a large number of RNA-seq datasets. However, only 119 transcripts are predicted as coding RNAs by the CPPred. In fact, almost all proposed novel coding RNAs are ncRNAs (91.1%), which is consistent with previous reports. Remarkably, we also reveal that the global description of encoding features (T2, C0 and GC) plays an important role in the prediction of coding potential.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(A) Three-dimensional plot of Hexamer score, Fickett score and ORF length on 33360 coding RNAs and 24163 ncRNAs (Human-Training). (B) Three-dimensional plot of Hexamer score, Fickett score and ORF length on 508 small coding RNAs and 508 small ncRNAs, which extracted small coding RNAs with ORF <303 nucleotides in length and small ncRNAs from Human-Training.
Figure 2.
Figure 2.
The flowchart of building training set and testing set of human. Human coding RNAs with transcript status ‘KNOWN’ are downloaded from NCBI RefSeq and human ncRNAs are downloaded from Ensembl. The initial dataset includes 50 040 coding RNAs and 37 297 ncRNAs. For ncRNAs, the data that have no source comments and are not annotated with Havana in the corresponding of gff3 file are removed. After that, the number of coding RNAs and ncRNAs is 50 040 and 36 244, respectively. We randomly selected two-thirds of the data as training set, a collection of 33 360 coding RNAs and 24 163 ncRNAs, which is called Human-Training. Then, the rest of the data are stored as a testing set. At the same time, we reduced redundancy between the testing and training set using CD-hit with sequence identity cutoff ≥80%. Finally, 8557 coding RNAs and 8241 ncRNAs are kept as Human-Testing. Then, the sequences with ORF shorter than 303 nucleotides in length are extracted from coding RNA in Human-Testing. Meanwhile, the same amount of considerable length ncRNAs from Human-Testing are selected randomly. As a result, 641 coding RNAs and 641 ncRNAs are kept as Human-sORF-Testing.
Figure 3.
Figure 3.
Pipeline of the CPPred. Multiple features are extracted from RNA or protein sequences. Herein, the CTD features include nucleotide composition, nucleotide transition and nucleotide distribution. The ORF coverage is defined as the ratio of ORF to the length of a transcript. The ORF length, Hexamer score and Fickett score are discussed in the ‘CPPred features’ section. The integrity of the ORF is defined as whether the ORF starts with a start codon (AUG) and ends with a stop codon (UGA, UAA or UAG). PI, Gravy and Instability are calculated by the ProtParam. After that, using mRMR-IFS, the best feature subset is selected and used as input to the SVM classifier. Eventually, we got the final model, which is tested and evaluated by the testing sets.
Figure 4.
Figure 4.
mRMR-IFS feature selection. The mRMR-IFS scatter plot of the feature subsets are drawn by R tool, which corresponds to the two training sets that are Human-Training and Integrated-Training, respectively. Wherein the x-coordinate is the number of features in the feature subset, and the y-coordinate represents the MCC of the corresponding 10-fold cross-validation.

Similar articles

Cited by

References

    1. Wang Z., Gerstein M., Snyder M.. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009; 10:57–63. - PMC - PubMed
    1. Nagalakshmi U., Wang Z., Waern K., Shou C., Raha D., Gerstein M., Snyder M.. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008; 320:1344–1349. - PMC - PubMed
    1. Lister R., O’Malley R.C., Tonti-Filippini J., Gregory B.D., Berry C.C., Millar A.H., Ecker J.R.. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008; 133:523–536. - PMC - PubMed
    1. Junttila S., Rudd S.. Characterization of a transcriptome from a non-model organism, Cladonia rangiferina, the grey reindeer lichen, using high-throughput next generation sequencing and EST sequence data. BMC Genomics. 2012; 13:575. - PMC - PubMed
    1. Wang Y., Li Y., Wang Q., Lv Y., Wang S., Chen X., Yu X., Jiang W., Li X.. Computational identification of human long intergenic non-coding RNAs using a GA-SVM algorithm. Gene. 2014; 533:94–99. - PubMed

Publication types

LinkOut - more resources