Variable selection and validation in multivariate modelling

Lin Shi^{1

2}, Johan A Westerhuis^{3

4}, Johan Rosén⁵, Rikard Landberg^{1

2}, Carl Brunius²

Affiliations

¹ Department of Molecular Sciences, Swedish University of Agricultural Sciences, Uppsala SE-750 07, Sweden.
² Department of Biology and Biological Engineering, Food and Nutrition Science, Chalmers University of Technology, Gothenburg SE-412 96, Sweden.
³ Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam XH, The Netherlands.
⁴ Metabolomics Center, North-West University, X6001, Potchefstroom, South Africa.
⁵ Swedish National Food Agency, Uppsala, Sweden.

PMID: 30165467
PMCID: PMC6419897
DOI: 10.1093/bioinformatics/bty710

Variable selection and validation in multivariate modelling

Lin Shi et al. Bioinformatics. 2019.

. 2019 Mar 15;35(6):972-980.

doi: 10.1093/bioinformatics/bty710.

Authors

Lin Shi^{1

2}, Johan A Westerhuis^{3

4}, Johan Rosén⁵, Rikard Landberg^{1

2}, Carl Brunius²

Affiliations

¹ Department of Molecular Sciences, Swedish University of Agricultural Sciences, Uppsala SE-750 07, Sweden.
² Department of Biology and Biological Engineering, Food and Nutrition Science, Chalmers University of Technology, Gothenburg SE-412 96, Sweden.
³ Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam XH, The Netherlands.
⁴ Metabolomics Center, North-West University, X6001, Potchefstroom, South Africa.
⁵ Swedish National Food Agency, Uppsala, Sweden.

PMID: 30165467
PMCID: PMC6419897
DOI: 10.1093/bioinformatics/bty710

Abstract

Motivation: Validation of variable selection and predictive performance is crucial in construction of robust multivariate models that generalize well, minimize overfitting and facilitate interpretation of results. Inappropriate variable selection leads instead to selection bias, thereby increasing the risk of model overfitting and false positive discoveries. Although several algorithms exist to identify a minimal set of most informative variables (i.e. the minimal-optimal problem), few can select all variables related to the research question (i.e. the all-relevant problem). Robust algorithms combining identification of both minimal-optimal and all-relevant variables with proper cross-validation are urgently needed.

Results: We developed the MUVR algorithm to improve predictive performance and minimize overfitting and false positives in multivariate analysis. In the MUVR algorithm, minimal variable selection is achieved by performing recursive variable elimination in a repeated double cross-validation (rdCV) procedure. The algorithm supports partial least squares and random forest modelling, and simultaneously identifies minimal-optimal and all-relevant variable sets for regression, classification and multilevel analyses. Using three authentic omics datasets, MUVR yielded parsimonious models with minimal overfitting and improved model performance compared with state-of-the-art rdCV. Moreover, MUVR showed advantages over other variable selection algorithms, i.e. Boruta and VSURF, including simultaneous variable selection and validation scheme and wider applicability.

Availability and implementation: Algorithms, data, scripts and tutorial are open source and available as an R package ('MUVR') at https://gitlab.com/CarlBrunius/MUVR.git.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Working principle of MUVR. (A) Graphical representation of the MUVR algorithm. The original data are randomly subdivided into **OUTER** segments. For each outer segment, the remaining (**INNER**) data are used for training and tuning of model parameters, including recursive ranking and backward elimination of variables. Each outer segment is then predicted using an optimized consensus model trained on all inner observations, ensuring that the holdout test set is never used for training or tuning modelling parameters. The procedure is then repeated for improved modelling performance. (B) Pseudocode of the MUVR algorithm

**Fig. 2.**
MUVR validation plots for identification of the all-relevant (‘max’ model) and minimal-optimal (‘min’ model) variables on three datasets: (A) ‘**Freelive**’, regression; (B) ‘**Mosquito**’, classification; (C) ‘**Crisp**’, multilevel. Results are presented for PLS (left) and random forest (right). Validation plots can be generated using the MUVR ‘*plotVAL*’ function

**Fig. 3.**
Flowchart of the permutation-by-class approach and the reclassification of variables from the MUVR-PLS classification on ‘Mosquito’ data using permutations-by-class approach. The ‘Optimal’ variable set is selected in the MUVR ‘min’ model. The ‘Redundant’ variable set belongs to the all-relevant variable set selected in the MUVR ‘max’ model, but not belonging to the minimal-optimal variable set. The ‘Noisy’ variable set contains presumably uninformative variables that are not selected in the MUVR ‘max’ model. The permuted variable refers to the distinct variable class after permutation. Details are given in 2.2.4 Evaluation of stability of variable selection using MUVR

**Fig. 4.**
Performance of MUVR or repeated double cross-validation models (rdCV) built from actual data and random permutations for three datasets: (A) ‘**Freelive**’, regression; (B) ‘**Mosquito**’, classification; (C) ‘**Crisp**’, multi-level. The performance distributions of random permutations are represented as violin plots, with the asterisks representing actual model performance (Q² for regression, number of misclassifications for classification and multilevel analysis)

See this image and copyright information in PMC

References

1. Afanador N.L. (2016) Unsupervised random forest: a tutorial with case studies. J. Chemom., 30, 231–241.
1. Ambroise C., McLachlan G.J. (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA, 99, 6562–6566. - PMC - PubMed
1. Baumann D., Baumann K. (2014) Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation. J. Cheminform., 6, 1–19. - PMC - PubMed
1. Boulesteix A.L. (2007) WilcoxCV: an R package for fast variable selection in cross-validation. Bioinformatics, 23, 1702–1704. - PubMed
1. Buck M. et al. (2016) Bacterial associations reveal spatial population dynamics in Anopheles gambiae mosquitoes. Sci. Rep., 6, 22806. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Variable selection and validation in multivariate modelling

Affiliations

Variable selection and validation in multivariate modelling

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources