CPPred: coding potential prediction based on the global description of RNA sequence

Xiaoxue Tong¹, Shiyong Liu¹

Affiliations

PMID: 30753596
PMCID: PMC6486542
DOI: 10.1093/nar/gkz087

CPPred: coding potential prediction based on the global description of RNA sequence

Xiaoxue Tong et al. Nucleic Acids Res. 2019.

. 2019 May 7;47(8):e43.

doi: 10.1093/nar/gkz087.

Authors

Xiaoxue Tong¹, Shiyong Liu¹

Affiliation

¹ School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China.

PMID: 30753596
PMCID: PMC6486542
DOI: 10.1093/nar/gkz087

Abstract

The rapid and accurate approach to distinguish between coding RNAs and ncRNAs has been playing a critical role in analyzing thousands of novel transcripts, which have been generated in recent years by next-generation sequencing technology. Previously developed methods CPAT, CPC2 and PLEK can distinguish coding RNAs and ncRNAs very well, but poorly distinguish between small coding RNAs and small ncRNAs. Herein, we report an approach, CPPred (coding potential prediction), which is based on SVM classifier and multiple sequence features including novel RNA features encoded by the global description. The CPPred can better distinguish not only between coding RNAs and ncRNAs, but also between small coding RNAs and small ncRNAs than the state-of-the-art methods due to the addition of the novel RNA features. A recent study proposes 1335 novel human coding RNAs from a large number of RNA-seq datasets. However, only 119 transcripts are predicted as coding RNAs by the CPPred. In fact, almost all proposed novel coding RNAs are ncRNAs (91.1%), which is consistent with previous reports. Remarkably, we also reveal that the global description of encoding features (T2, C0 and GC) plays an important role in the prediction of coding potential.

PubMed Disclaimer

Figures

**Figure 1.**
(A) Three-dimensional plot of Hexamer score, Fickett score and ORF length on 33360 coding RNAs and 24163 ncRNAs (Human-Training). (B) Three-dimensional plot of Hexamer score, Fickett score and ORF length on 508 small coding RNAs and 508 small ncRNAs, which extracted small coding RNAs with ORF <303 nucleotides in length and small ncRNAs from Human-Training.

**Figure 2.**
The flowchart of building training set and testing set of human. Human coding RNAs with transcript status ‘KNOWN’ are downloaded from NCBI RefSeq and human ncRNAs are downloaded from Ensembl. The initial dataset includes 50 040 coding RNAs and 37 297 ncRNAs. For ncRNAs, the data that have no source comments and are not annotated with Havana in the corresponding of gff3 file are removed. After that, the number of coding RNAs and ncRNAs is 50 040 and 36 244, respectively. We randomly selected two-thirds of the data as training set, a collection of 33 360 coding RNAs and 24 163 ncRNAs, which is called Human-Training. Then, the rest of the data are stored as a testing set. At the same time, we reduced redundancy between the testing and training set using CD-hit with sequence identity cutoff ≥80%. Finally, 8557 coding RNAs and 8241 ncRNAs are kept as Human-Testing. Then, the sequences with ORF shorter than 303 nucleotides in length are extracted from coding RNA in Human-Testing. Meanwhile, the same amount of considerable length ncRNAs from Human-Testing are selected randomly. As a result, 641 coding RNAs and 641 ncRNAs are kept as Human-sORF-Testing.

**Figure 3.**
Pipeline of the CPPred. Multiple features are extracted from RNA or protein sequences. Herein, the CTD features include nucleotide composition, nucleotide transition and nucleotide distribution. The ORF coverage is defined as the ratio of ORF to the length of a transcript. The ORF length, Hexamer score and Fickett score are discussed in the ‘CPPred features’ section. The integrity of the ORF is defined as whether the ORF starts with a start codon (AUG) and ends with a stop codon (UGA, UAA or UAG). PI, Gravy and Instability are calculated by the ProtParam. After that, using mRMR-IFS, the best feature subset is selected and used as input to the SVM classifier. Eventually, we got the final model, which is tested and evaluated by the testing sets.

**Figure 4.**
mRMR-IFS feature selection. The mRMR-IFS scatter plot of the feature subsets are drawn by R tool, which corresponds to the two training sets that are Human-Training and Integrated-Training, respectively. Wherein the x-coordinate is the number of features in the feature subset, and the y-coordinate represents the MCC of the corresponding 10-fold cross-validation.

See this image and copyright information in PMC

References

1. Wang Z., Gerstein M., Snyder M.. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009; 10:57–63. - PMC - PubMed
1. Nagalakshmi U., Wang Z., Waern K., Shou C., Raha D., Gerstein M., Snyder M.. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008; 320:1344–1349. - PMC - PubMed
1. Lister R., O’Malley R.C., Tonti-Filippini J., Gregory B.D., Berry C.C., Millar A.H., Ecker J.R.. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008; 133:523–536. - PMC - PubMed
1. Junttila S., Rudd S.. Characterization of a transcriptome from a non-model organism, Cladonia rangiferina, the grey reindeer lichen, using high-throughput next generation sequencing and EST sequence data. BMC Genomics. 2012; 13:575. - PMC - PubMed
1. Wang Y., Li Y., Wang Q., Lv Y., Wang S., Chen X., Yu X., Jiang W., Li X.. Computational identification of human long intergenic non-coding RNAs using a GA-SVM algorithm. Gene. 2014; 533:94–99. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- FlyBase
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CPPred: coding potential prediction based on the global description of RNA sequence

Affiliation

CPPred: coding potential prediction based on the global description of RNA sequence

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous