Gene selection in cancer classification using sparse logistic regression with Bayesian regularization
- PMID: 16844704
- DOI: 10.1093/bioinformatics/btl386
Gene selection in cancer classification using sparse logistic regression with Bayesian regularization
Abstract
Motivation: Gene selection algorithms for cancer classification, based on the expression of a small number of biomarker genes, have been the subject of considerable research in recent years. Shevade and Keerthi propose a gene selection algorithm based on sparse logistic regression (SLogReg) incorporating a Laplace prior to promote sparsity in the model parameters, and provide a simple but efficient training procedure. The degree of sparsity obtained is determined by the value of a regularization parameter, which must be carefully tuned in order to optimize performance. This normally involves a model selection stage, based on a computationally intensive search for the minimizer of the cross-validation error. In this paper, we demonstrate that a simple Bayesian approach can be taken to eliminate this regularization parameter entirely, by integrating it out analytically using an uninformative Jeffrey's prior. The improved algorithm (BLogReg) is then typically two or three orders of magnitude faster than the original algorithm, as there is no longer a need for a model selection step. The BLogReg algorithm is also free from selection bias in performance estimation, a common pitfall in the application of machine learning algorithms in cancer classification.
Results: The SLogReg, BLogReg and Relevance Vector Machine (RVM) gene selection algorithms are evaluated over the well-studied colon cancer and leukaemia benchmark datasets. The leave-one-out estimates of the probability of test error and cross-entropy of the BLogReg and SLogReg algorithms are very similar, however the BlogReg algorithm is found to be considerably faster than the original SLogReg algorithm. Using nested cross-validation to avoid selection bias, performance estimation for SLogReg on the leukaemia dataset takes almost 48 h, whereas the corresponding result for BLogReg is obtained in only 1 min 24 s, making BLogReg by far the more practical algorithm. BLogReg also demonstrates better estimates of conditional probability than the RVM, which are of great importance in medical applications, with similar computational expense.
Availability: A MATLAB implementation of the sparse logistic regression algorithm with Bayesian regularization (BLogReg) is available from http://theoval.cmp.uea.ac.uk/~gcc/cbl/blogreg/
Similar articles
-
Cancer classification and prediction using logistic regression with Bayesian gene selection.J Biomed Inform. 2004 Aug;37(4):249-59. doi: 10.1016/j.jbi.2004.07.009. J Biomed Inform. 2004. PMID: 15465478
-
Independent component analysis-based penalized discriminant method for tumor classification using gene expression data.Bioinformatics. 2006 Aug 1;22(15):1855-62. doi: 10.1093/bioinformatics/btl190. Epub 2006 May 18. Bioinformatics. 2006. PMID: 16709589
-
Predicting survival from microarray data--a comparative study.Bioinformatics. 2007 Aug 15;23(16):2080-7. doi: 10.1093/bioinformatics/btm305. Epub 2007 Jun 6. Bioinformatics. 2007. PMID: 17553857
-
Classification based upon gene expression data: bias and precision of error rates.Bioinformatics. 2007 Jun 1;23(11):1363-70. doi: 10.1093/bioinformatics/btm117. Epub 2007 Mar 28. Bioinformatics. 2007. PMID: 17392326 Review.
-
Microarray-based cancer diagnosis with artificial neural networks.Biotechniques. 2003 Mar;Suppl:30-5. Biotechniques. 2003. PMID: 12664682 Review.
Cited by
-
LogSum + L2 penalized logistic regression model for biomarker selection and cancer classification.Sci Rep. 2020 Dec 17;10(1):22125. doi: 10.1038/s41598-020-79028-0. Sci Rep. 2020. PMID: 33335163 Free PMC article.
-
Sparse Bayesian classification and feature selection for biological expression data with high correlations.PLoS One. 2017 Dec 27;12(12):e0189541. doi: 10.1371/journal.pone.0189541. eCollection 2017. PLoS One. 2017. PMID: 29281700 Free PMC article.
-
A pilot study of ion current estimation by ANN from action potential waveforms.J Biol Phys. 2022 Dec;48(4):461-475. doi: 10.1007/s10867-022-09619-7. Epub 2022 Nov 14. J Biol Phys. 2022. PMID: 36372807 Free PMC article.
-
ccSVM: correcting Support Vector Machines for confounding factors in biological data classification.Bioinformatics. 2011 Jul 1;27(13):i342-8. doi: 10.1093/bioinformatics/btr204. Bioinformatics. 2011. PMID: 21685091 Free PMC article.
-
New Approach Combining Molecular Fingerprints and Machine Learning to Estimate Relative Ionization Efficiency in Electrospray Ionization.ACS Omega. 2020 Apr 14;5(16):9510-9516. doi: 10.1021/acsomega.0c00732. eCollection 2020 Apr 28. ACS Omega. 2020. PMID: 32363303 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous