oFVSD: a Python package of optimized forward variable selection decoder for high-dimensional neuroimaging data

Tung Dang^{1

2}, Alan S R Fermin¹, Maro G Machizawa¹

Affiliations

¹ Center for Brain, Mind, and KANSEI Sciences Research, Hiroshima University, Hiroshima, Japan.
² Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan.

PMID: 37829329
PMCID: PMC10566623
DOI: 10.3389/fninf.2023.1266713

oFVSD: a Python package of optimized forward variable selection decoder for high-dimensional neuroimaging data

Tung Dang et al. Front Neuroinform. 2023.

. 2023 Sep 26:17:1266713.

doi: 10.3389/fninf.2023.1266713. eCollection 2023.

Authors

Tung Dang^{1

2}, Alan S R Fermin¹, Maro G Machizawa¹

Affiliations

¹ Center for Brain, Mind, and KANSEI Sciences Research, Hiroshima University, Hiroshima, Japan.
² Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan.

PMID: 37829329
PMCID: PMC10566623
DOI: 10.3389/fninf.2023.1266713

Abstract

The complexity and high dimensionality of neuroimaging data pose problems for decoding information with machine learning (ML) models because the number of features is often much larger than the number of observations. Feature selection is one of the crucial steps for determining meaningful target features in decoding; however, optimizing the feature selection from such high-dimensional neuroimaging data has been challenging using conventional ML models. Here, we introduce an efficient and high-performance decoding package incorporating a forward variable selection (FVS) algorithm and hyper-parameter optimization that automatically identifies the best feature pairs for both classification and regression models, where a total of 18 ML models are implemented by default. First, the FVS algorithm evaluates the goodness-of-fit across different models using the k-fold cross-validation step that identifies the best subset of features based on a predefined criterion for each model. Next, the hyperparameters of each ML model are optimized at each forward iteration. Final outputs highlight an optimized number of selected features (brain regions of interest) for each model with its accuracy. Furthermore, the toolbox can be executed in a parallel environment for efficient computation on a typical personal computer. With the optimized forward variable selection decoder (oFVSD) pipeline, we verified the effectiveness of decoding sex classification and age range regression on 1,113 structural magnetic resonance imaging (MRI) datasets. Compared to ML models without the FVS algorithm and with the Boruta algorithm as a variable selection counterpart, we demonstrate that the oFVSD significantly outperformed across all of the ML models over the counterpart models without FVS (approximately 0.20 increase in correlation coefficient, r, with regression models and 8% increase in classification models on average) and with Boruta variable selection algorithm (approximately 0.07 improvement in regression and 4% in classification models). Furthermore, we confirmed the use of parallel computation considerably reduced the computational burden for the high-dimensional MRI data. Altogether, the oFVSD toolbox efficiently and effectively improves the performance of both classification and regression ML models, providing a use case example on MRI datasets. With its flexibility, oFVSD has the potential for many other modalities in neuroimaging. This open-source and freely available Python package makes it a valuable toolbox for research communities seeking improved decoding accuracy.

Keywords: MRI; VBM (voxel-based morphometry); forward variable selection; machine learning; neural decoding; optimized hyperparameter.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Workflow schematics of the automatic ML toolbox coupled with the forward variable selection (FVS) algorithm. All features, i.e., the gray matter volume data from each region of interest (ROI), undergo the FVS step, followed by either regression-based or classification-based ML with K-fold cross-validation (CV). The random search and grid search strategies with cross-validation were adopted to optimize the hyperparameters of the ML models at each iteration of the FVS algorithm. The final outcomes were evaluated based on the MSE and MAE for regression-based models and the AUC and confusion matrix for classification-based model. n is the number of samples, m is the total number of ROIs (246 ROIs in this study) and x is the number of ROIs that the user wants to select.

**Figure 2**
Performance comparison of 11 regression models with Boruta algorithm, with and without forward variable selection (FVS) to predict age, controlling for total intracranial volume (TIV). Left (blue): 11 regression models without the FVS algorithm. Middle (orange): 11 regression models on a subset of brain regions selected with the Boruta algorithm. Right (green): 11 regression models on a subset of brain regions selected with the FVS algorithm. P values were calculated using one-way repeated measures ANOVA tests with Benjamini–Hochberg correction for multiple comparisons for 11 pairs of models. *p < 0.05, **p < 0.01, ***p < 0.001.

**Figure 3**
Performance comparison of LassoLar regression with Boruta algorithm, with and without the forward variable selection (FVS) algorithm to predict age, controlling for the effects of total intracranial volume (TIV). Left panel: LassoLar regression with all of brain regions (MSE = 0.45, Spearman ρ = 0.44, p = 0.064). Middle panel: LassoLar regression on a subset of brain regions selected with the Boruta algorithm (MSE = 0.4, Spearman ρ = 0.51, p < 0.0001). Right panel: LassoLar regression on a subset of brain regions selected with the FVS algorithm (MSE = 0.36, Spearman ρ = 0.63, p < 0.0001). Predicted age data are plotted as a function of the true score. The blue lines and blue shades represent a linear regression line with a confidence interval.

**Figure 4**
Selected brain regions significantly associated with age. The red color denotes a positive correlation with age; the green color denotes a negative correlation with age. **(A)** Premotor thalamus (left), **(B)** premotor thalamus (right), **(C)** sensory thalamus (left), **(D)** orbital gyrus lateral area 11, **(E)** orbital gyrus orbital area 12/47, **(F)** basal ganglia dorsolateral putamen.

**Figure 5**
Performance comparison of 7 classification models with Boruta algorithm, with and without the forward variable selection (FVS) algorithm to classify male and female groups, controlling for the effects of total intracranial volume (TIV). Left (blue): 7 classification models without the FVS algorithm. Middle (orange): 7 classification models on a subset of brain regions selected by the Boruta algorithm. Right (green): 7 classification models on a subset of brain regions selected by the FVS algorithm. P values were calculated using one-way repeated measures ANOVA tests with Benjamini–Hochberg correction for multiple comparisons. *p < 0.05, **p < 0.01, ***p < 0.001.

**Figure 6**
Performance comparison of the random forest classifier with Boruta algorithm, with and without the forward variable selection (FVS) algorithm to classify two groups, controlling for the effects of total intracranial volume (TIV). Left panel: random forest classifier analysis with all of brain regions. Middle panel: random forest classifier on a subset of brain regions selected by the Boruta algorithm. Right panel: random forest classifier on a subset of brain regions selected by the FVS algorithm.

**Figure 7**
Selected brain regions identified as predictors of the sex categories (male and female). The red color denotes male predicting volume > female predicting volume; the green color denotes male predicting volume < female predicting volume. **(A)** Premotor thalamus, **(B)** inferior frontal gyrus dorsal area 44, **(C)** precuneus medial area 5 (PEm), **(D)** basal ganglia dorsolateral putamen, **(E)** basal ganglia nucleus accumbens, **(F)** ligual gyrus medio ventral occipital caudal.

See this image and copyright information in PMC

References

1. Agrawal T. (2021). “Hyperparameter optimization using Scikit-learn” in Hyperparameter optimization in machine learning (Berkeley, CA: Apress; ), 31–51.
1. Al-Nesf M. A. Y., Abdesselem H. B., Bensmail I., Ibrahim S., Saeed W. A. H., Mohammed S. S. I., et al. (2022). Prognostic tools and candidate drugs based on plasma proteomics of patients with severe COVID-19 complications. Nat. Commun. 13:946. doi: 10.1038/s41467-022-28639-4, PMID: - DOI - PMC - PubMed
1. Bergstra J., Bengio Y. (2012). Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305. doi: 10.5555/2188385.2188395 - DOI
1. Bisong E. (2019). “More supervised machine learning techniques with Scikit-learn” in Building machine learning and deep learning models on Google cloud platform (Berkeley, United States: Apress; ), 287–308.
1. Blanco R., Larrañaga P., Inza I., Sierra B. (2004). Gene selection for cancer classification using wrapper approaches. Int. J. Pattern Recognit. Artif. Intell. 18, 1373–1390. doi: 10.1142/S0218001404003800 - DOI

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

oFVSD: a Python package of optimized forward variable selection decoder for high-dimensional neuroimaging data

Affiliations

oFVSD: a Python package of optimized forward variable selection decoder for high-dimensional neuroimaging data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources