Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;6(12):e28210.
doi: 10.1371/journal.pone.0028210. Epub 2011 Dec 21.

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Affiliations

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Anne-Claire Haury et al. PLoS One. 2011.

Abstract

Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. In this study we compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Surprisingly, complex wrapper and embedded methods generally do not outperform simple univariate feature selection methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Area under the ROC curve.
Signature of size formula image in a formula image-fold CV setting and averaged over the four datasets.
Figure 2
Figure 2. Area under the ROC Curve.
NC classifier trained as a function of the size of the signature, for different feature selection methods, in a formula image-fold CV setting averaged over the four datasets.
Figure 3
Figure 3. Area under the ROC Curve.
NC classifier trained as a function of the number of samples in a formula image-fold CV setting. We show here the accuracy for 100-gene signatures as averaged over the formula image datasets. Note that the maximum value of the x axis is constrained by the smallest dataset, namely GSE2990.
Figure 4
Figure 4. Area under the ROC Curve.
NC classifier trained as a function of the number of samples in a formula image-fold CV setting for each of the four datasets. We show here the accuracy for 100-gene signatures.
Figure 5
Figure 5. Stability for a signature of size 100.
Average and standard errors are obtained over the four datasets. a) Soft-perturbation setting. b) Hard-perturbation setting. c) Between-datasets setting.
Figure 6
Figure 6. Evolution of stability of t-test signatures with respect to the size of the training set in the hard-perturbation and the between datasets settings from GSE2034 and GSE4922.
Figure 7
Figure 7. Stability of different methods in the between-dataset setting, as a function of the size of the signature.
Figure 8
Figure 8. GO interpretability for a signature of size 100.
Average number of GO BP terms significantly over-represented.
Figure 9
Figure 9. GO stability for a signature of size 100 in the soft-perturbation setting.
Average and standard errors are obtained over the four datasets. A) Soft-perturbation setting. B) Hard-perturbation setting. C) Between-datasets setting.
Figure 10
Figure 10. Bias in the selection through entropy and Bhattacharyya distance.
Estimated cumulative distribution functions (ECDF) of the first ten genes selected by four methods on GSE1456. They are compared to the ECDF of formula image randomly chosen background genes.
Figure 11
Figure 11. Estimated distribution of the first gene selected by entropy and Bhattacharyya distance.
Figure 12
Figure 12. Accuracy/stability trade-off.
Accuracy versus stability for each method in the between-datasets setting. We show here the average results over the four datasets.

Similar articles

Cited by

References

    1. Sotiriou C, Pusztai L. Gene-expression signatures in breast cancer. N Engl J Med. 2009;360:790–800. - PubMed
    1. Ioannidis JPA. Microarrays and molecular research: noise discovery? Lancet. 2005;365:454. - PubMed
    1. Ein-Dor L, Kela I, Getz G, Givol D, Domany E. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics. 2005;21:171–178. - PubMed
    1. Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet. 2005;365:488–492. - PubMed
    1. Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci USA. 2006;103:5923–5928. - PMC - PubMed

Publication types