Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Oct 2:7:33.
doi: 10.1186/1745-6150-7-33.

Stable feature selection and classification algorithms for multiclass microarray data

Affiliations

Stable feature selection and classification algorithms for multiclass microarray data

Sebastian Student et al. Biol Direct. .

Abstract

Background: Recent studies suggest that gene expression profiles are a promising alternative for clinical cancer classification. One major problem in applying DNA microarrays for classification is the dimension of obtained data sets. In this paper we propose a multiclass gene selection method based on Partial Least Squares (PLS) for selecting genes for classification. The new idea is to solve multiclass selection problem with the PLS method and decomposition to a set of two-class sub-problems: one versus rest (OvR) and one versus one (OvO). We use OvR and OvO two-class decomposition for other recently published gene selection method. Ranked gene lists are highly unstable in the sense that a small change of the data set often leads to big changes in the obtained ordered lists. In this paper, we take a look at the assessment of stability of the proposed methods. We use the linear support vector machines (SVM) technique in different variants: one versus one, one versus rest, multiclass SVM (MSVM) and the linear discriminant analysis (LDA) as a classifier. We use balanced bootstrap to estimate the prediction error and to test the variability of the obtained ordered lists.

Results: This paper focuses on effective identification of informative genes. As a result, a new strategy to find a small subset of significant genes is designed. Our results on real multiclass cancer data show that our method has a very high accuracy rate for different combinations of classification methods, giving concurrently very stable feature rankings.

Conclusions: This paper shows that the proposed strategies can improve the performance of selected gene sets substantially. OvR and OvO techniques applied to existing gene selection methods improve results as well. The presented method allows to obtain a more reliable classifier with less classifier error. In the same time the method generates more stable ordered feature lists in comparison with existing methods.

PubMed Disclaimer

Figures

Figure 1
Figure 1
PLS based gene selection method with two-class decomposition technique.
Figure 2
Figure 2
Stability indexs2(bar chart) and accuracy of classification (dot chart) with the 95% confidence interval of the best classifier on the tested feature selection methods for LUNG data.
Figure 3
Figure 3
Accuracy of classification obtained by successive gene set reduction selected with all feature selection methods of the best classifier for LUNG data.
Figure 4
Figure 4
Results of bootstrap-based feature ranking (BBFR) for the first 50 genes for LUNG data. In the ideal case (when gene lists are perfectly reproducible) the BBFR score reaches a value of 1 for the first selected genes and 0 for the rest (black curve).
Figure 5
Figure 5
Comparison of rank boxplots in the bootstrap samples against rank in the original data set on all tested methods for LUNG data.
Figure 6
Figure 6
Stability index s2 (bar chart) and accuracy of classification (dot chart) with the 95% confidence interval of the best classifier on the tested feature selection methods for MLL data.
Figure 7
Figure 7
Stability index s2 (bar chart) and accuracy of classification (dot chart) with the 95% confidence interval of the best classifier on the tested feature selection methods for SRBCT data.
Figure 8
Figure 8
Accuracy of classification obtained by successive gene set reduction selected with all feature selection methods of the best classifier for MLL data.
Figure 9
Figure 9
Accuracy of classification obtained by successive gene set reduction selected with all feature selection methods of the best classifier for SRBCT data.
Figure 10
Figure 10
Results of bootstrap-based feature ranking (BBFR) for the first 50 genes for MLL data. In the ideal case (when gene lists are perfectly reproducible) the BBFR score reaches a value of 1 for the first selected genes and 0 for the rest (black curve).
Figure 11
Figure 11
Results of bootstrap-based feature ranking (BBFR) for the first 50 genes for SRBCT data. In the ideal case (when gene lists are perfectly reproducible) the BBFR score reaches a value of 1 for the first selected genes and 0 for the rest (black curve).
Figure 12
Figure 12
Comparison of rank boxplots in the bootstrap samples against rank in the original data set on all tested methods for MLL data.
Figure 13
Figure 13
Comparison of rank boxplots in the bootstrap samples against rank in the original data set on all tested methods for SRBCT data.

References

    1. He Z, Yu W. Stable feature selection for biomarker discovery. Comput Biol and Chem. 2010;34(4):215–225. doi: 10.1016/j.compbiolchem.2010.07.002. [ http://arxiv.org/abs/1001.0887] - DOI - PubMed
    1. Binder H, Krohn K, Burden CJ. Washing scaling of GeneChip microarray expression. BMC Bioinf. 2010;11:291. doi: 10.1186/1471-2105-11-291. [ http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2901370&tool=p...] - DOI - PMC - PubMed
    1. Binder H, Preibisch S, Berger H. Calibration of microarray gene-expression data. Methods In Mol Biol Clifton Nj. 2010;576(16):375–407. [ http://www.ncbi.nlm.nih.gov/pubmed/19882273] - PubMed
    1. Dutkowski J, Gambin A. On consensus biomarker selection. BMC Bioinf. 2007;8(Suppl 5):S5. doi: 10.1186/1471-2105-8-S5-S5. [ http://www.ncbi.nlm.nih.gov/pubmed/17570864] - DOI - PMC - PubMed
    1. Zhang T, Li C, Ogihara M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004;20(15):2429–2437. doi: 10.1093/bioinformatics/bth267. - DOI - PubMed

Publication types