Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug 23;12(8):e0182507.
doi: 10.1371/journal.pone.0182507. eCollection 2017.

A comprehensive simulation study on classification of RNA-Seq data

Affiliations

A comprehensive simulation study on classification of RNA-Seq data

Gökmen Zararsız et al. PLoS One. .

Abstract

RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. RNA-Seq classification workflow.
Fig 2
Fig 2. Genewise dispersion estimations for real datasets.
Fig 3
Fig 3. Simulation results for k = 2, dkj = 10%, transformation: rlog.
Figure shows the performance results of classifiers with changing parameters of sample size (n), number of genes (p) and type of dispersion (φ = 0.01: very slight, φ = 0.1: substantial, φ = 1: very high).
Fig 4
Fig 4. Simulation results for k = 3, dkj = 10%, transformation: rlog.
Figure shows the performance results of classifiers with changing parameters of sample size (n), number of genes (p) and type of dispersion (φ = 0.01: very slight, φ = 0.1: substantial, φ = 1: very high).
Fig 5
Fig 5. Results obtained from real datasets.
Figure shows the performance results of classifiers for datasets with changing number of most significant number of genes. Note that PLDA and NBLDA methods are not performed on the transformed data. However, the results for both transformed and non-transformed data are given in the same figure for the comparison purpose.

References

    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10(1): 57–63. doi: 10.1038/nrg2484 - DOI - PMC - PubMed
    1. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000, 16(10):906–914. - PubMed
    1. Rapaport F, Zinovyev A, Dutreix M, Barillot E, Vert JP. Classification of microarray data using gene networks. BMC Bioinformatics. 2007, 8(1):35. - PMC - PubMed
    1. Uriarte RD, de Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinformatic. 2006, 7(1):3. - PMC - PubMed
    1. Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics. 2004, 5(3):427–443. doi: 10.1093/biostatistics/5.3.427 - DOI - PubMed