A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets

Carmen Lai¹, Marcel J T Reinders, Laura J van't Veer, Lodewyk F A Wessels

Affiliations

PMID: 16670007
PMCID: PMC1569875
DOI: 10.1186/1471-2105-7-235

Comparative Study

A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets

Carmen Lai et al. BMC Bioinformatics. 2006.

. 2006 May 2:7:235.

doi: 10.1186/1471-2105-7-235.

Authors

Carmen Lai¹, Marcel J T Reinders, Laura J van't Veer, Lodewyk F A Wessels

Affiliation

¹ Information and Communication Theory Group, Delft University of Technology, Delft, The Netherlands. c.lai@ewi.tudelft.nl

PMID: 16670007
PMCID: PMC1569875
DOI: 10.1186/1471-2105-7-235

Abstract

Background: Gene selection is an important step when building predictors of disease state based on gene expression data. Gene selection generally improves performance and identifies a relevant subset of genes. Many univariate and multivariate gene selection approaches have been proposed. Frequently the claim is made that genes are co-regulated (due to pathway dependencies) and that multivariate approaches are therefore per definition more desirable than univariate selection approaches. Based on the published performances of all these approaches a fair comparison of the available results can not be made. This mainly stems from two factors. First, the results are often biased, since the validation set is in one way or another involved in training the predictor, resulting in optimistically biased performance estimates. Second, the published results are often based on a small number of relatively simple datasets. Consequently no generally applicable conclusions can be drawn.

Results: In this study we adopted an unbiased protocol to perform a fair comparison of frequently used multivariate and univariate gene selection techniques, in combination with a ränge of classifiers. Our conclusions are based on seven gene expression datasets, across several cancer types.

Conclusion: Our experiments illustrate that, contrary to several previous studies, in five of the seven datasets univariate selection approaches yield consistently better results than multivariate approaches. The simplest multivariate selection approach, the Top Scoring method, achieves the best results on the remaining two datasets. We conclude that the correlation structures, if present, are difficult to extract due to the small number of samples, and that consequently, overly-complex gene selection algorithms that attempt to extract these structures are prone to overtraining.

PubMed Disclaimer

Figures

**Figure 1**
The training-validation protocol employed to evaluate various gene selection and classification approaches in simplified schematic format. The input is a labeled dataset, D, and the Output is an estimate of the validation performance of algorithm A, denoted by P_AThe most important steps in the protocol are the training step (Block labeled 'Train') and the validation step (Block labeled 'Validate'). The training step, in turn, consists of two steps, namely 1) the optimization of the gene selection parameter, ϕ, employing a N_i– fold cross validation loop and 2) training the final classifier glven the optimal setting of the selection parameter. The validation step estimates the performance of the optimal trained classifier ((ωA*) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqadaqaaGGaciab=L8a3naaDaaaleaacqWGbbqqaeaacqGGQaGkaaaakiaawIcacaGLPaaaaaa@3227@) on the completely independent validation set.

See this image and copyright information in PMC

References

1. Kohavi G Rand John. Wrappers for Feature Subset Selection. Artificial Intelligence. 1997;97:273–324.
1. Tssamardinos C land Aliferis Towards Principled Feature Selection: Relevancy, Filters and Wrappers. Ninth International Workshop on Artificial Intelligence and Statistics. 2003.
1. Ein-Dor L, Kela I, Getz G, Givol D, Domany E. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics. 2004 - PubMed
1. Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z. Proceedings of the fourth annual international Conference on Computational molecular biology. Tokyo, Japan: ACM Press; 2000. Tissue classification with gene expression profiles; pp. 54–64. - PubMed
1. Blanco R, Larranaga P, Inza I, Sierra B. Gene selection for cancer classification using wrapper approaches. International Journal of Pattern Recognition and Artificial Intelligence. 2004;18:1373–1390.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets

Affiliation

A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources