Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jun 22:12:253.
doi: 10.1186/1471-2105-12-253.

Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems

Affiliations

Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems

Kim-Anh Lê Cao et al. BMC Bioinformatics. .

Abstract

Background: Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits.

Results: A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework.

Conclusions: sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the R package mixOmics, which is dedicated to the analysis of large biological data sets.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Choosing the number of dimensions in sPLS-DA. Estimated classification error rates for Brain and SNP (10 cross-validation averaged 10 times) with respect to each sPLS-DA dimension. The different lines represent the number of variables selected on each dimension (going from 5 to p).
Figure 2
Figure 2
Comparisons of the classification performance with other variable selection approaches. Estimated classification error rates for Leukemia, SRBCT, Brain, GCM and the SNP data set (10 cross-validation averaged 10 times) with respect to the number of selected genes (from 5 to p) for the wrapper approaches and the sparse exploratory approaches.
Figure 3
Figure 3
Stability analysis. Stability frequency using bolasso for the first two dimensions of sPLS-DA for GCM (top) and SNP data (bottom). One has to sequentially choose the most stabler genes/SNPs in the first dimension in order to pursue the stability analysis for the next sPLS-DA dimension.
Figure 4
Figure 4
Brain data: sample representation and comparison with classical PLS-DA. Comparisons of the sample representation using the first 2 latent variables from PLS-DA (no variable selection) and sPLS-DA (26 genes selected).
Figure 5
Figure 5
SNP data: sample representation with PCA. Sample representations using the first 5 principal components from PCA.
Figure 6
Figure 6
SNP data: sample representation with classical PLS-DA. Sample representation using the first 5 latent variables from PLS-DA (no SNPs selected).
Figure 7
Figure 7
SNP data: sample representation with sPLS-DA. Sample representation using the first 5 latent variables from sPLS-DA (1000 SNPs selected on each dimension).
Figure 8
Figure 8
Brain data: representation of the loading vectors. Absolute value of the weights in the loading vectors for each sPLS-DA dimension. Only the genes with non zero weights are considered in the sPLS-DA analysis and are included in the gene selection.
Figure 9
Figure 9
Brain data: variable representation. (a) projection of the sPLS-DA selected variables on correlation circles with the R mixOmics package; (b) biological network generated with GeneGo using the same list of genes. Genes that are present in the network (b) are labelled in green, red and magenta in (a).

References

    1. Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M. et al.Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286(5439):531. doi: 10.1126/science.286.5439.531. - DOI - PubMed
    1. Dudoit S, Fridlyand J, Speed T. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association. 2002;97(457):77–88. doi: 10.1198/016214502753479248. - DOI
    1. Guyon I, Elisseefi A, Kaelbling L. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research. 2003;3(7-8):1157–1182. doi: 10.1162/153244303322753616. - DOI
    1. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J. et al.Gene Ontology: tool for the unification of biology. Nature genetics. 2000;25:25–29. doi: 10.1038/75556. - DOI - PMC - PubMed
    1. Lê Cao KA, Bonnet A, Gadat S. Multiclass classification and gene selection with a stochastic algorithm. Computational Statistics and Data Analysis. 2009;53:3601–3615. doi: 10.1016/j.csda.2009.02.028. - DOI

Publication types