Lung cancer risk prediction using augmented machine learning pipelines with explainable AI
- PMID: 40969171
- PMCID: PMC12440954
- DOI: 10.3389/frai.2025.1602775
Lung cancer risk prediction using augmented machine learning pipelines with explainable AI
Abstract
Lung cancer remains the leading cause of cancer-related deaths worldwide, making early and precise diagnosis is critical for improving the patient survival rates. Machine learning has shown promising results in predictive analysis for lung cancer prediction. However, class imbalance in clinical datasets negatively impacts the performance of Machine Learning classifiers, leading to biased predictions and reduced accuracy. In an attempt to address this issue, various data augmentation techniques were applied alongside classification models to enhance predictive performance. This study evaluates data augmentation techniques paired with machine learning classifiers to address class imbalance in a small lung cancer dataset. A comparative analysis was conducted to assess the impact of different augmentation techniques with classification models. Experimental findings demonstrate that K-Means SMOTE, combined with a Multi-Layer Perceptron classifier, achieves the highest accuracy of 93.55% and an AUC-ROC score of 96.76%, surpassing other augmentation-classifier combinations. These results underscore the importance of selecting optimal augmentation methods to improve classification performance. Furthermore, to ensure model interpretability and transparency in medical decision-making, LIME is utilized to provide insights into model predictions. The study highlights the significance of advanced augmentation techniques in addressing data imbalance, ultimately enhancing lung cancer risk prediction through machine learning. The findings contribute to the growing field of AI-driven healthcare by emphasizing the necessity of selecting effective augmentation-classifier pairs to develop more accurate and reliable diagnostic models. Due to the dataset's high cancer prevalence (87.45%) and limited size, this work is a preliminary methodological comparison, not a clinical tool. Findings emphasize the importance of augmentation for imbalanced data and lay the groundwork for future validation with larger, representative datasets.
Keywords: SMOTE; class imbalance; explainable AI; lime; lung cancer prediction.
Copyright © 2025 M S, D and Chakrabortty.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures
References
-
- Ahmed S., Kaiser M. S., Hossain M. S., Andersson K. (2024). A comparative analysis of lime and shap interpreters with explainable ml-based diabetes predictions. IEEE Access 13. doi: 10.1109/ACCESS.2024.3422319 - DOI
-
- Al-Jamimi H. A., Ayad S., El Kheir A. (2025). Integrating advanced techniques: RFE-SVM feature engineering and Nelder-Mead optimized XGBoost for accurate lung cancer prediction. IEEE Access. doi: 10.1109/ACCESS.2025.3536034 - DOI
-
- Almahasneh M., Xie X., Paiement A. (2024). Attentnet: fully convolutional 3D attention for lung nodule detection. SN Comput. Sci. 6:799. doi: 10.1007/s42979-025-03799-4 - DOI
-
- Alzahrani A. (2025). Early detection of lung Cancer using predictive modeling incorporating CTGAN features and tree-based learning. IEEE Access 13, 34321–34333. doi: 10.1109/ACCESS.2025.3543215 - DOI
LinkOut - more resources
Full Text Sources
Research Materials
