Scalable and robust machine learning framework for HIV classification using clinical and laboratory data
- PMID: 40436911
- PMCID: PMC12119985
- DOI: 10.1038/s41598-025-00085-4
Scalable and robust machine learning framework for HIV classification using clinical and laboratory data
Abstract
Human Immunodeficiency Virus (HIV) is a retrovirus that weakens the immune system, increasing vulnerability to infections and cancers. HIV spreads primarily via sharing needles, from mother to child during childbirth or breastfeeding, or unprotected sexual intercourse. Therefore, early diagnosis and treatment are crucial to prevent the disease progression of HIV to AIDS, which is associated with higher mortality. This study introduces a machine learning-based framework for the classification of HIV infections crucial for preventing the disease's progression and transmission risk to improve long-term health outcomes. Firstly, the challenges posed by an imbalanced dataset is addressed, using the Synthetic Minority Over-sampling Technique (SMOTE) oversampling technique, which was chosen over two alternative methods based on its superior performance. Additionally, we enhance dataset quality by removing outliers using the interquartile range (IQR) method. A comprehensive two-step feature selection process is employed, resulting in a reduction from 22 original features to 12 critical variables. We evaluate five machine learning models, identifying the Random Forest Classifier (RFC) and Decision Tree Classifier (DTC) as the most effective, as they demonstrate higher classification performance compared to the other models. By integrating these models into a voting classifier, we achieve an overall accuracy of 89%, a precision of 90.84%, a recall of 87.63%, and a F1-score of 98.21%. The model undergoes validation on multiple external datasets with varying instance counts, reinforcing its robustness. Furthermore, an analysis focusing solely on CD4 and CD8 cell counts which are essential lab test data for HIV monitoring, demonstrates an accuracy of 87%, emphasizing the significance of these clinical features for the classification task. Moreover, these outcomes underscore the potential of combining machine learning techniques with critical clinical data to enhance the accuracy of HIV infection classification, ultimately contributing to improved patient management and treatment strategies. These findings also highlight the scalability of the approach, showing that it can be efficiently adapted for large-scale use across various healthcare environments, including those with limited resources, making it suitable for widespread deployment in both high- and low-resource settings.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Competing interests: The authors declare no competing interests.
Figures

















Similar articles
-
Development of an efficient novel method for coronary artery disease prediction using machine learning and deep learning techniques.Technol Health Care. 2024;32(6):4545-4569. doi: 10.3233/THC-240740. Technol Health Care. 2024. PMID: 39031414 Free PMC article.
-
Application of machine learning algorithms in predicting HIV infection among men who have sex with men: Model development and validation.Front Public Health. 2022 Aug 25;10:967681. doi: 10.3389/fpubh.2022.967681. eCollection 2022. Front Public Health. 2022. PMID: 36091522 Free PMC article.
-
Machine learning applications to classify and monitor medication adherence in patients with type 2 diabetes in Ethiopia.Front Endocrinol (Lausanne). 2025 Mar 20;16:1486350. doi: 10.3389/fendo.2025.1486350. eCollection 2025. Front Endocrinol (Lausanne). 2025. PMID: 40182636 Free PMC article.
-
Hybrid statistical and machine-learning approach to hearing-loss identification based on an oversampling technique.Comput Biol Med. 2025 Feb;185:109539. doi: 10.1016/j.compbiomed.2024.109539. Epub 2024 Dec 12. Comput Biol Med. 2025. PMID: 39672012
-
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217. Cochrane Database Syst Rev. 2022. PMID: 36321557 Free PMC article.
References
-
- Tu, W. et al. Machine learning models reveal neurocognitive impairment type and prevalence are associated with distinct variables in HIV/AIDS. J. Neurovirol. 26, 41–51 (2020). - PubMed
-
- Maman, A., Pacal, I. & Bati, F. Can deep learning effectively diagnose cardiac amyloidosis with 99mTc-PYP scintigraphy? J. Radioanal Nucl. Chem.334, 1033–1048 (2025).
-
- Bayram, B., Kunduracioglu, I., Ince, S. & Pacal, I. A systematic review of deep learning in MRI-based cerebral vascular occlusion-based brain diseases. Neuroscience vol. 568 76–94 Preprint at (2025). 10.1016/j.neuroscience.2025.01.020 - PubMed
MeSH terms
LinkOut - more resources
Full Text Sources
Medical
Research Materials
Miscellaneous