A Partial Least Squares based algorithm for parsimonious variable selection

Tahir Mehmood¹, Harald Martens, Solve Sæbø, Jonas Warringer, Lars Snipen

Affiliations

PMID: 22142365
PMCID: PMC3287970
DOI: 10.1186/1748-7188-6-27

A Partial Least Squares based algorithm for parsimonious variable selection

Tahir Mehmood et al. Algorithms Mol Biol. 2011.

. 2011 Dec 5;6(1):27.

doi: 10.1186/1748-7188-6-27.

Authors

Tahir Mehmood¹, Harald Martens, Solve Sæbø, Jonas Warringer, Lars Snipen

Affiliation

¹ Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Norway. tahir.mehmood@umb.no.

PMID: 22142365
PMCID: PMC3287970
DOI: 10.1186/1748-7188-6-27

Abstract

Background: In genomics, a commonly encountered problem is to extract a subset of variables out of a large set of explanatory variables associated with one or several quantitative or qualitative response variables. An example is to identify associations between codon-usage and phylogeny based definitions of taxonomic groups at different taxonomic levels. Maximum understandability with the smallest number of selected variables, consistency of the selected variables, as well as variation of model performance on test data, are issues to be addressed for such problems.

Results: We present an algorithm balancing the parsimony and the predictive performance of a model. The algorithm is based on variable selection using reduced-rank Partial Least Squares with a regularized elimination. Allowing a marginal decrease in model performance results in a substantial decrease in the number of selected variables. This significantly improves the understandability of the model. Within the approach we have tested and compared three different criteria commonly used in the Partial Least Square modeling paradigm for variable selection; loading weights, regression coefficients and variable importance on projections. The algorithm is applied to a problem of identifying codon variations discriminating different bacterial taxa, which is of particular interest in classifying metagenomics samples. The results are compared with a classical forward selection algorithm, the much used Lasso algorithm as well as Soft-threshold Partial Least Squares variable selection.

Conclusions: A regularized elimination algorithm based on Partial Least Squares produces results that increase understandability and consistency and reduces the classification error on test data compared to standard approaches.

PubMed Disclaimer

Figures

**Figure 1**
**Flow chart**. The flow chart illustrates the proposed algorithm for variable selection.

**Figure 2**
**An overview of the testing/training**. An overview of the testing/training procedure used in this study. The rectangles illustrate the predictor matrix. At level 1 we split the data into a test set and training set (25/75) to be used by all four methods listed on the right. This was repeated 100 times. Inside our suggested method, the stepwise elimination, there are two levels of cross-validation. First a 10-fold cross-validation was used to optimize selection parameters f and d, and at level 3 leave-one-out cross-validation was used to optimize the regularized CPPLS method.

**Figure 3**
**A typical elimination**. A typical elimination is shown based on the data for phylum *Actinobacteria*. Each dot in the figure indicates one iteration. The procedure starts on the left hand side, with the full model. After some iterations performance(P), which reflects the percentage of correctly classified samples, has increased, and reaches a maximum. Further elimination reduces performance, but only marginally. When elimination becomes too severe, the performance drops substantially. Finally, the selected model is found where we have the smallest model with performance not significantly worse than the maximum.

**Figure 4**
**The distribution of selected variables**. The distribution of the number of variables selected by the optimum model and selected model for loading weights, VIP and regression coefficients is presented in upper panels, while lower panels display similar for Forward, Lasso and ST-PLS. The horizontal axes are the number of retained variables as percentage of the full model (with 4160 variables). All results are based on 100 random samples from the full data set, where 75% of the objects are used as training data and 25% as test data in each sample.

**Figure 5**
**Performance comparison**. The left panel presents the distribution of performance of in the full model, optimum model and selected models on test and training data sets for loading weights, VIP and regression coefficients, while the right panels display similar for Forward, Lasso and ST-PLS. All results are based on 100 random samples from the full data set, where 75% of the objects are used as training data and 25% as test data in each sample.

**Figure 6**
**Selectivity score**. The selectivity score is sorted in descending order for each criterion loading weights, regression coefficients significance and VIP in the left panels, while right panels display similar for Forward, Lasso and ST-PLS. Only the first 500 values (out of 4160) are shown.

See this image and copyright information in PMC

Cited by

Contrasting signatures of genomic divergence during sympatric speciation.
Kautt AF, Kratochwil CF, Nater A, Machado-Schiaffino G, Olave M, Henning F, Torres-Dowdall J, Härer A, Hulsey CD, Franchini P, Pippel M, Myers EW, Meyer A. Kautt AF, et al. Nature. 2020 Dec;588(7836):106-111. doi: 10.1038/s41586-020-2845-0. Epub 2020 Oct 28. Nature. 2020. PMID: 33116308 Free PMC article.
Fecal microbiota composition of breast-fed infants is correlated with human milk oligosaccharides consumed.
Wang M, Li M, Wu S, Lebrilla CB, Chapkin RS, Ivanov I, Donovan SM. Wang M, et al. J Pediatr Gastroenterol Nutr. 2015 Jun;60(6):825-33. doi: 10.1097/MPG.0000000000000752. J Pediatr Gastroenterol Nutr. 2015. PMID: 25651488 Free PMC article.
Robust Wavelength Selection Using Filter-Wrapper Method and Input Scaling on Near Infrared Spectral Data.
Silalahi DD, Midi H, Arasan J, Mustafa MS, Caliman JP. Silalahi DD, et al. Sensors (Basel). 2020 Sep 3;20(17):5001. doi: 10.3390/s20175001. Sensors (Basel). 2020. PMID: 32899292 Free PMC article.
The glucose-lowering effect of low-dose diacerein and its responsiveness metabolic markers in uncontrolled diabetes.
Jangsiripornpakorn J, Srisuk S, Chailurkit L, Nimitphong H, Saetung S, Ongphiphadhanakul B. Jangsiripornpakorn J, et al. BMC Res Notes. 2022 Mar 4;15(1):91. doi: 10.1186/s13104-022-05974-9. BMC Res Notes. 2022. PMID: 35246243 Free PMC article. Clinical Trial.
An ensemble variable selection method for vibrational spectroscopic data analysis.
Zhang J, Yan H, Xiong Y, Li Q, Min S. Zhang J, et al. RSC Adv. 2019 Feb 26;9(12):6708-6716. doi: 10.1039/c8ra08754g. eCollection 2019 Feb 22. RSC Adv. 2019. PMID: 35548689 Free PMC article.

See all "Cited by" articles

References

1. Bachvarov B, Kirilov K, Ivanov I. Codon usage in prokaryotes. Biotechnology and Biotechnological Equipment. 2008;22(2):669.
1. Binnewies T, Motro Y, Hallin P, Lund O, Dunn D, La T, Hampson D, Bellgard M, Wassenaar T, Ussery D. Ten years of bacterial genome sequencing: comparative-genomics-based discoveries. Functional & integrative genomics. 2006;6(3):165–185. doi: 10.1007/s10142-006-0027-2. - DOI - PubMed
1. Shendure J, Porreca G, Reppas N, Lin X, McCutcheon J, Rosenbaum A, Wang M, Zhang K, Mitra R, Church G. Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005;309(5741):1728. doi: 10.1126/science.1117389. - DOI - PubMed
1. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 2008.
1. Fernández Pierna J, Abbas O, Baeten V, Dardenne P. A Backward Variable Selection method for PLS regression (BVSPLS) Analytica chimica acta. 2009;642(1-2):89–93. doi: 10.1016/j.aca.2008.12.002. - DOI - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Partial Least Squares based algorithm for parsimonious variable selection

Affiliation

A Partial Least Squares based algorithm for parsimonious variable selection

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources