Scalable and robust machine learning framework for HIV classification using clinical and laboratory data

doi:10.1038/s41598-025-00085-4

. 2025 May 28;15(1):18727.

doi: 10.1038/s41598-025-00085-4.

Scalable and robust machine learning framework for HIV classification using clinical and laboratory data

Qian Sui¹, Gaoxu Li^{2

3}, Yaqi Peng⁴, Jiasheng Zhang⁵, Yibo Zhang^{6

7}, Riyang Zhao⁸

Affiliations

¹ The Fourth Hospital of Hebei Medical University, Shijiazhuang, China.
² Department of Mathematics, Xi'an Jiaotong-Liverpool University, Xi'an, China.
³ Gezhi Future Research Institute, No.1501 Building L, HaiDian District, Beijing, China.
⁴ The Second Hospital of Hebei Medical University, Hebei, China.
⁵ School of international business, Anhui International Studies University, Wuhu, Anhui, China.
⁶ Gezhi Future Research Institute, No.1501 Building L, HaiDian District, Beijing, China. yibozh@pku.edu.cn.
⁷ School of Systems and Computing, UNSW Australia, UNSW Canberra, R118, B15, Canberra, ACT, 2600, Australia. yibozh@pku.edu.cn.
⁸ The Fourth Hospital of Hebei Medical University, Shijiazhuang, China. 49206307@hebmu.edu.cn.

PMID: 40436911
PMCID: PMC12119985
DOI: 10.1038/s41598-025-00085-4

Scalable and robust machine learning framework for HIV classification using clinical and laboratory data

Qian Sui et al. Sci Rep. 2025.

. 2025 May 28;15(1):18727.

doi: 10.1038/s41598-025-00085-4.

Authors

Qian Sui¹, Gaoxu Li^{2

3}, Yaqi Peng⁴, Jiasheng Zhang⁵, Yibo Zhang^{6

7}, Riyang Zhao⁸

Affiliations

¹ The Fourth Hospital of Hebei Medical University, Shijiazhuang, China.
² Department of Mathematics, Xi'an Jiaotong-Liverpool University, Xi'an, China.
³ Gezhi Future Research Institute, No.1501 Building L, HaiDian District, Beijing, China.
⁴ The Second Hospital of Hebei Medical University, Hebei, China.
⁵ School of international business, Anhui International Studies University, Wuhu, Anhui, China.
⁶ Gezhi Future Research Institute, No.1501 Building L, HaiDian District, Beijing, China. yibozh@pku.edu.cn.
⁷ School of Systems and Computing, UNSW Australia, UNSW Canberra, R118, B15, Canberra, ACT, 2600, Australia. yibozh@pku.edu.cn.
⁸ The Fourth Hospital of Hebei Medical University, Shijiazhuang, China. 49206307@hebmu.edu.cn.

PMID: 40436911
PMCID: PMC12119985
DOI: 10.1038/s41598-025-00085-4

Abstract

Human Immunodeficiency Virus (HIV) is a retrovirus that weakens the immune system, increasing vulnerability to infections and cancers. HIV spreads primarily via sharing needles, from mother to child during childbirth or breastfeeding, or unprotected sexual intercourse. Therefore, early diagnosis and treatment are crucial to prevent the disease progression of HIV to AIDS, which is associated with higher mortality. This study introduces a machine learning-based framework for the classification of HIV infections crucial for preventing the disease's progression and transmission risk to improve long-term health outcomes. Firstly, the challenges posed by an imbalanced dataset is addressed, using the Synthetic Minority Over-sampling Technique (SMOTE) oversampling technique, which was chosen over two alternative methods based on its superior performance. Additionally, we enhance dataset quality by removing outliers using the interquartile range (IQR) method. A comprehensive two-step feature selection process is employed, resulting in a reduction from 22 original features to 12 critical variables. We evaluate five machine learning models, identifying the Random Forest Classifier (RFC) and Decision Tree Classifier (DTC) as the most effective, as they demonstrate higher classification performance compared to the other models. By integrating these models into a voting classifier, we achieve an overall accuracy of 89%, a precision of 90.84%, a recall of 87.63%, and a F1-score of 98.21%. The model undergoes validation on multiple external datasets with varying instance counts, reinforcing its robustness. Furthermore, an analysis focusing solely on CD4 and CD8 cell counts which are essential lab test data for HIV monitoring, demonstrates an accuracy of 87%, emphasizing the significance of these clinical features for the classification task. Moreover, these outcomes underscore the potential of combining machine learning techniques with critical clinical data to enhance the accuracy of HIV infection classification, ultimately contributing to improved patient management and treatment strategies. These findings also highlight the scalability of the approach, showing that it can be efficiently adapted for large-scale use across various healthcare environments, including those with limited resources, making it suitable for widespread deployment in both high- and low-resource settings.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
The illustration of proposed methodology of this study.

**Fig. 2**
Comparison of infected and non-infected sample counts before and after data augmentation.

**Fig. 3**
Distribution of features before outlier treatment.

**Fig. 4**
Distribution of features after outlier treatment.

**Fig. 6**
Feature ranking according to MAD analysis.

**Fig. 7**
Feature importance comparison for HIV classification using tree-based models and linear model.

**Fig. 8**
The confusion matrices of RFC with 22 features and with 12 features respectively.

**Fig. 9**
The confusion matrices of DTC with 22 features and with 12 features respectively.

**Fig. 10**
The confusion matrices of LR with 22 features and with 12 features respectively.

**Fig. 11**
The confusion matrices of AB with 22 features and with 12 features respectively.

**Fig. 12**
The confusion matrices of KNN with 22 features and with 12 features respectively.

**Fig. 13**
Confusion matric and ROC curve with 22 features.

**Fig. 14**
Confusion matric and ROC curve with selected 12 features.

**Fig. 15**
Performance comparison of the proposed voting classifier across different dataset sizes.

**Fig. 16**
Voting classifier performance—training and inference times vs. dataset size.

**Fig. 17**
Distribution of prediction errors for varying dataset sizes in the voting classifier.

See this image and copyright information in PMC

References

1. Fieggen, J., Smith, E., Arora, L. & Segal, B. The role of machine learning in HIV risk prediction. Front. Reproductive Health4, (2022). - PMC - PubMed
1. Tu, W. et al. Machine learning models reveal neurocognitive impairment type and prevalence are associated with distinct variables in HIV/AIDS. J. Neurovirol. 26, 41–51 (2020). - PubMed
1. Seboka, B. T., Yehualashet, D. E. & Tesfa, G. A. Artificial intelligence and machine learning based prediction of viral load and CD4 status of people living with HIV (PLWH) on Anti-Retroviral treatment in Gedeo zone public hospitals. Int. J. Gen. Med.16, 435–451 (2023). - PMC - PubMed
1. Maman, A., Pacal, I. & Bati, F. Can deep learning effectively diagnose cardiac amyloidosis with 99mTc-PYP scintigraphy? J. Radioanal Nucl. Chem.334, 1033–1048 (2025).
1. Bayram, B., Kunduracioglu, I., Ince, S. & Pacal, I. A systematic review of deep learning in MRI-based cerebral vascular occlusion-based brain diseases. Neuroscience vol. 568 76–94 Preprint at (2025). 10.1016/j.neuroscience.2025.01.020 - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Fieggen, J., Smith, E., Arora, L. & Segal, B. The role of machine learning in HIV risk prediction. Front. Reproductive Health4, (2022). - PMC - PubMed

[2] Fieggen, J., Smith, E., Arora, L. & Segal, B. The role of machine learning in HIV risk prediction. Front. Reproductive Health4, (2022). - PMC - PubMed

[3] Tu, W. et al. Machine learning models reveal neurocognitive impairment type and prevalence are associated with distinct variables in HIV/AIDS. J. Neurovirol. 26, 41–51 (2020). - PubMed

[4] Tu, W. et al. Machine learning models reveal neurocognitive impairment type and prevalence are associated with distinct variables in HIV/AIDS. J. Neurovirol. 26, 41–51 (2020). - PubMed

[5] Seboka, B. T., Yehualashet, D. E. & Tesfa, G. A. Artificial intelligence and machine learning based prediction of viral load and CD4 status of people living with HIV (PLWH) on Anti-Retroviral treatment in Gedeo zone public hospitals. Int. J. Gen. Med.16, 435–451 (2023). - PMC - PubMed

[6] Seboka, B. T., Yehualashet, D. E. & Tesfa, G. A. Artificial intelligence and machine learning based prediction of viral load and CD4 status of people living with HIV (PLWH) on Anti-Retroviral treatment in Gedeo zone public hospitals. Int. J. Gen. Med.16, 435–451 (2023). - PMC - PubMed

[7] Maman, A., Pacal, I. & Bati, F. Can deep learning effectively diagnose cardiac amyloidosis with 99mTc-PYP scintigraphy? J. Radioanal Nucl. Chem.334, 1033–1048 (2025).

[8] Maman, A., Pacal, I. & Bati, F. Can deep learning effectively diagnose cardiac amyloidosis with 99mTc-PYP scintigraphy? J. Radioanal Nucl. Chem.334, 1033–1048 (2025).

[9] Bayram, B., Kunduracioglu, I., Ince, S. & Pacal, I. A systematic review of deep learning in MRI-based cerebral vascular occlusion-based brain diseases. Neuroscience vol. 568 76–94 Preprint at (2025). 10.1016/j.neuroscience.2025.01.020 - PubMed

[10] Bayram, B., Kunduracioglu, I., Ince, S. & Pacal, I. A systematic review of deep learning in MRI-based cerebral vascular occlusion-based brain diseases. Neuroscience vol. 568 76–94 Preprint at (2025). 10.1016/j.neuroscience.2025.01.020 - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Scalable and robust machine learning framework for HIV classification using clinical and laboratory data

Affiliations

Scalable and robust machine learning framework for HIV classification using clinical and laboratory data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Research Materials

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Medical

Research Materials

Miscellaneous