. 2011;6(12):e28210.

doi: 10.1371/journal.pone.0028210. Epub 2011 Dec 21.

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Anne-Claire Haury¹, Pierre Gestraud, Jean-Philippe Vert

Affiliations

PMID: 22205940
PMCID: PMC3244389
DOI: 10.1371/journal.pone.0028210

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Anne-Claire Haury et al. PLoS One. 2011.

. 2011;6(12):e28210.

doi: 10.1371/journal.pone.0028210. Epub 2011 Dec 21.

Authors

Anne-Claire Haury¹, Pierre Gestraud, Jean-Philippe Vert

Affiliation

¹ Mines ParisTech, Centre for Computational Biology, Fontainebleau, France. anne-claire.haury@mines-paristech.fr

PMID: 22205940
PMCID: PMC3244389
DOI: 10.1371/journal.pone.0028210

Abstract

Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. In this study we compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Surprisingly, complex wrapper and embedded methods generally do not outperform simple univariate feature selection methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Area under the ROC curve.**
Signature of size in a -fold CV setting and averaged over the four datasets.

formula image — **Figure 1. Area under the ROC curve.**
Signature of size in a -fold CV setting and averaged over the four datasets.

**Figure 2. Area under the ROC Curve.**
NC classifier trained as a function of the size of the signature, for different feature selection methods, in a -fold CV setting averaged over the four datasets.

**Figure 3. Area under the ROC Curve.**
NC classifier trained as a function of the number of samples in a -fold CV setting. We show here the accuracy for 100-gene signatures as averaged over the datasets. Note that the maximum value of the x axis is constrained by the smallest dataset, namely GSE2990.

**Figure 4. Area under the ROC Curve.**
NC classifier trained as a function of the number of samples in a -fold CV setting for each of the four datasets. We show here the accuracy for 100-gene signatures.

**Figure 5. Stability for a signature of size 100.**
Average and standard errors are obtained over the four datasets. a) Soft-perturbation setting. b) Hard-perturbation setting. c) Between-datasets setting.

**Figure 6. Evolution of stability of t-test signatures with respect to the size of the training set in the hard-perturbation and the between datasets settings from GSE2034 and GSE4922.**

**Figure 7. Stability of different methods in the between-dataset setting, as a function of the size of the signature.**

**Figure 8. GO interpretability for a signature of size 100.**
Average number of GO BP terms significantly over-represented.

**Figure 9. GO stability for a signature of size 100 in the soft-perturbation setting.**
Average and standard errors are obtained over the four datasets. A) Soft-perturbation setting. B) Hard-perturbation setting. C) Between-datasets setting.

**Figure 10. Bias in the selection through entropy and Bhattacharyya distance.**
Estimated cumulative distribution functions (ECDF) of the first ten genes selected by four methods on GSE1456. They are compared to the ECDF of randomly chosen background genes.

**Figure 11. Estimated distribution of the first gene selected by entropy and Bhattacharyya distance.**

**Figure 12. Accuracy/stability trade-off.**
Accuracy versus stability for each method in the between-datasets setting. We show here the average results over the four datasets.

See this image and copyright information in PMC

References

1. Sotiriou C, Pusztai L. Gene-expression signatures in breast cancer. N Engl J Med. 2009;360:790–800. - PubMed
1. Ioannidis JPA. Microarrays and molecular research: noise discovery? Lancet. 2005;365:454. - PubMed
1. Ein-Dor L, Kela I, Getz G, Givol D, Domany E. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics. 2005;21:171–178. - PubMed
1. Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet. 2005;365:488–492. - PubMed
1. Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci USA. 2006;103:5923–5928. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Affiliation

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources