Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 6;12(8):845.
doi: 10.3390/bioengineering12080845.

The Effect of Data Leakage and Feature Selection on Machine Learning Performance for Early Parkinson's Disease Detection

Affiliations

The Effect of Data Leakage and Feature Selection on Machine Learning Performance for Early Parkinson's Disease Detection

Jonathan Starcke et al. Bioengineering (Basel). .

Abstract

If we do not urgently educate current and future medical professionals to critically evaluate and distinguish credible AI-assisted diagnostic tools from those whose performance is artificially inflated by data leakage or improper validation, we risk undermining clinician trust in all AI diagnostics and jeopardizing future advances in patient care. For instance, machine learning models have shown high accuracy in diagnosing Parkinson's Disease when trained on clinical features that are themselves diagnostic, such as tremor and rigidity. This study systematically investigates the impact of data leakage and feature selection on the true clinical utility of machine learning models for early Parkinson's Disease detection. We constructed two experimental pipelines: one excluding all overt motor symptoms to simulate a subclinical scenario and a control including these features. Nine machine learning algorithms were evaluated using a robust three-way data split and comprehensive metric analysis. Results reveal that, without overt features, all models exhibited superficially acceptable F1 scores but failed catastrophically in specificity, misclassifying most healthy controls as Parkinson's Disease. The inclusion of overt features dramatically improved performance, confirming that high accuracy was due to data leakage rather than genuine predictive power. These findings underscore the necessity of rigorous experimental design, transparent reporting, and critical evaluation of machine learning models in clinically realistic settings. Our work highlights the risks of overestimating model utility due to data leakage and provides guidance for developing robust, clinically meaningful machine learning tools for early disease detection.

Keywords: Parkinson’s Disease; clinical validation; data leakage; early diagnosis; feature selection; machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
Illustration of the impact of feature selection on ML model accuracy for PD diagnosis. Left: Excluding overt motor symptoms simulates early detection, resulting in low accuracy. Right: Including overt symptoms (e.g., tremor, rigidity) yields high accuracy, but this is misleading due to data leakage—these features are themselves diagnostic and not available for true early detection. The temptation to include such features leads to artificially inflated performance metrics.
Figure 2
Figure 2
Bar graph illustrating training, validation, and test accuracy for each ML model. High-capacity models such as random forest, DNN, and AdaBoost achieve near-perfect training accuracy but show marked drops on the test set, indicating overfitting. Simpler models like logistic regression maintain lower but more consistent performance across splits.
Figure 3
Figure 3
Bar graph of training, validation, and test F1 scores for each ML model. Large discrepancies between training and test F1 scores in complex models highlight generalization failure. More consistent but modest F1 scores are seen in logistic regression and LASSO.
Figure 4
Figure 4
Confusion matrices highlighting diagnostic behavior of LASSO (left) and DNN (right) on the test set, with overt features excluded. The LASSO model demonstrates extreme overprediction of PD, misclassifying 111 of 120 healthy controls as having PD. The DNN model, while less accurate overall, achieves the lowest false positive rate among all models. These results illustrate how confusion matrices can reveal pathological model behavior that is hidden by summary metrics.
Figure 5
Figure 5
Confusion matrices for KNN (left) and random forest (right) on the test dataset when overt Parkinsonian motor features were included. The random forest model demonstrates superior specificity, with only 29 false positives, while KNN exhibits the highest false positive count (44) among the models analyzed.
Figure 6
Figure 6
Learning curves for (a) LASSO logistic regression trained without overt features and (b) KNN trained with overt features. Excluding overt features (a) results in modest and plateauing accuracy, with minimal gap between training and validation curves. Including overt features (b) yields substantially higher accuracy, but with a persistent gap between training and validation curves, indicating potential overfitting and the artificial boost in performance due to data leakage.
Figure 7
Figure 7
Learning curves for (a) random forest and (b) AdaBoost, both trained without overtly diagnostic features (i.e., features directly indicative of Parkinson’s Disease such as tremor or rigidity). Both models exhibit persistent overfitting: training accuracy remains near-perfect across all data sizes, while validation accuracy plateaus at substantially lower levels (approximately 0.5–0.6), with a large and stable gap between the two curves. This pattern indicates that, in the absence of strongly predictive features, high-capacity models such as random forest and AdaBoost are unable to generalize and instead memorize the training data.
Figure 8
Figure 8
Direct comparison of test accuracy for nine ML models with and without overt features. Including overt features (orange) yields dramatically higher accuracy across all models, highlighting the risk of data leakage and the misleading nature of such results for early detection scenarios.

References

    1. Yousefi M., Akhbari M., Mohamadi Z., Karami S., Dasoomi H., Atabi A., Sarkeshikian S.A., Abdoullahi Dehaki M., Bayati H., Mashayekhi N., et al. Machine learning based algorithms for virtual early detection and screening of neurodegenerative and neurocognitive disorders: A systematic-review. Front. Neurol. 2024;15:1413071. doi: 10.3389/fneur.2024.1413071. - DOI - PMC - PubMed
    1. Martorell-Marugán J., Chierici M., Bandres-Ciga S., Jurman G., Carmona-Sáez P. Machine Learning Applications in the Study of Parkinson’s Disease: A Systematic Review. Curr. Bioinform. 2023;18:576–586. doi: 10.2174/1574893618666230406085947. - DOI
    1. Rabie H., Akhloufi M.A. A review of machine learning and deep learning for Parkinson’s disease detection. Discov. Artif. Intell. 2025;5:24. doi: 10.1007/s44163-025-00241-9. - DOI - PMC - PubMed
    1. Tabashum T., Snyder R.C., O’Brien M.K., Albert M.V. Machine Learning Models for Parkinson Disease: Systematic Review. JMIR Med. Inform. 2024;12:e50117. doi: 10.2196/50117. - DOI - PMC - PubMed
    1. Park H., Youm C., Cheon S.M., Kim B., Choi H., Hwang J., Kim M. Using machine learning to identify Parkinson’s disease severity subtypes with multimodal data. J. Neuroeng. Rehabil. 2025;22:126. doi: 10.1186/s12984-025-01648-2. - DOI - PMC - PubMed

LinkOut - more resources