Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Sep 16;9 Suppl 2(Suppl 2):S24.
doi: 10.1186/1471-2164-9-S2-S24.

Selecting subsets of newly extracted features from PCA and PLS in microarray data analysis

Affiliations

Selecting subsets of newly extracted features from PCA and PLS in microarray data analysis

Guo-Zheng Li et al. BMC Genomics. .

Abstract

Background: Dimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers. It consists of two types of methods, i.e. feature selection and feature extraction. Principle component analysis (PCA) and partial least squares (PLS) are two frequently used feature extraction methods, and in the previous works, the top several components of PCA or PLS are selected for modeling according to the descending order of eigenvalues. While in this paper, we prove that not all the top features are useful, but features should be selected from all the components by feature selection methods.

Results: We demonstrate a framework for selecting feature subsets from all the newly extracted components, leading to reduced classification error rates on the gene expression microarray data. Here we have considered both an unsupervised method PCA and a supervised method PLS for extracting new components, genetic algorithms for feature selection, and support vector machines and k nearest neighbor for classification. Experimental results illustrate that our proposed framework is effective to select feature subsets and to reduce classification error rates.

Conclusion: Not only the top features newly extracted by PCA or PLS are important, therefore, feature selection should be performed to select subsets from new features to improve generalization performance of classifiers.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparison of distributions of eigenvectors used by GAPCASVM and GAPLSSVM with C = 10, σ = 0.01 for SVM. X-axis corresponds to the eigenvectors in descending order by their eigenvalues and has been divided into bins of size 5. Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by GA.
Figure 2
Figure 2
Comparison of distributions of eigenvectors used by GAPPSVM with C = 10, σ = 0.01 for SVM. X-axis corresponds to the eigenvectors in descending order by their eigenvalues and has been divided into bins of size 5. Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by GA.
Figure 3
Figure 3
Comparison of distributions of eigenvectors used by GAPCAKNN and GAPLSKNN with k = 1 for kNN. X-axis corresponds to the eigenvectors in descending order by their eigenvalues and has been divided into bins of size 5. Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by GA.
Figure 4
Figure 4
Comparison of distributions of eigenvectors used by GAPPKNN with k = 1 for kNN. X-axis corresponds to the eigenvectors in descending order by their eigenvalues and has been divided into bins of size 5. Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by GA.
Figure 5
Figure 5
A framework of dimension reduction for the analysis of gene microarray data.
Figure 6
Figure 6
Genetic algorithm based feature selection.

References

    1. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. - DOI - PubMed
    1. Alon U. Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proceedings of the National Academy of Sciences of the United States of America. 1999:6745–6750. doi: 10.1073/pnas.96.12.6745. - DOI - PMC - PubMed
    1. Dudoit S, Fridlyand J, Speed TP. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association. 2002;97:77–87. doi: 10.1198/016214502753479248. - DOI
    1. Jain AK, Duin RPW, Mao J. Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000;22:4–37. doi: 10.1109/34.824819. - DOI
    1. Sun Z, Bebis G, Miller R. Object Detection Using Feature Subset Selection. Pattern Recognition. 2004;37:2165–2176. doi: 10.1016/j.patcog.2004.03.013. - DOI

Publication types

MeSH terms

LinkOut - more resources