Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 28;15(1):18727.
doi: 10.1038/s41598-025-00085-4.

Scalable and robust machine learning framework for HIV classification using clinical and laboratory data

Affiliations

Scalable and robust machine learning framework for HIV classification using clinical and laboratory data

Qian Sui et al. Sci Rep. .

Abstract

Human Immunodeficiency Virus (HIV) is a retrovirus that weakens the immune system, increasing vulnerability to infections and cancers. HIV spreads primarily via sharing needles, from mother to child during childbirth or breastfeeding, or unprotected sexual intercourse. Therefore, early diagnosis and treatment are crucial to prevent the disease progression of HIV to AIDS, which is associated with higher mortality. This study introduces a machine learning-based framework for the classification of HIV infections crucial for preventing the disease's progression and transmission risk to improve long-term health outcomes. Firstly, the challenges posed by an imbalanced dataset is addressed, using the Synthetic Minority Over-sampling Technique (SMOTE) oversampling technique, which was chosen over two alternative methods based on its superior performance. Additionally, we enhance dataset quality by removing outliers using the interquartile range (IQR) method. A comprehensive two-step feature selection process is employed, resulting in a reduction from 22 original features to 12 critical variables. We evaluate five machine learning models, identifying the Random Forest Classifier (RFC) and Decision Tree Classifier (DTC) as the most effective, as they demonstrate higher classification performance compared to the other models. By integrating these models into a voting classifier, we achieve an overall accuracy of 89%, a precision of 90.84%, a recall of 87.63%, and a F1-score of 98.21%. The model undergoes validation on multiple external datasets with varying instance counts, reinforcing its robustness. Furthermore, an analysis focusing solely on CD4 and CD8 cell counts which are essential lab test data for HIV monitoring, demonstrates an accuracy of 87%, emphasizing the significance of these clinical features for the classification task. Moreover, these outcomes underscore the potential of combining machine learning techniques with critical clinical data to enhance the accuracy of HIV infection classification, ultimately contributing to improved patient management and treatment strategies. These findings also highlight the scalability of the approach, showing that it can be efficiently adapted for large-scale use across various healthcare environments, including those with limited resources, making it suitable for widespread deployment in both high- and low-resource settings.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The illustration of proposed methodology of this study.
Fig. 2
Fig. 2
Comparison of infected and non-infected sample counts before and after data augmentation.
Fig. 3
Fig. 3
Distribution of features before outlier treatment.
Fig. 4
Fig. 4
Distribution of features after outlier treatment.
Fig. 5
Fig. 5
Illustration of RFE ranking.
Fig. 6
Fig. 6
Feature ranking according to MAD analysis.
Fig. 7
Fig. 7
Feature importance comparison for HIV classification using tree-based models and linear model.
Fig. 8
Fig. 8
The confusion matrices of RFC with 22 features and with 12 features respectively.
Fig. 9
Fig. 9
The confusion matrices of DTC with 22 features and with 12 features respectively.
Fig. 10
Fig. 10
The confusion matrices of LR with 22 features and with 12 features respectively.
Fig. 11
Fig. 11
The confusion matrices of AB with 22 features and with 12 features respectively.
Fig. 12
Fig. 12
The confusion matrices of KNN with 22 features and with 12 features respectively.
Fig. 13
Fig. 13
Confusion matric and ROC curve with 22 features.
Fig. 14
Fig. 14
Confusion matric and ROC curve with selected 12 features.
Fig. 15
Fig. 15
Performance comparison of the proposed voting classifier across different dataset sizes.
Fig. 16
Fig. 16
Voting classifier performance—training and inference times vs. dataset size.
Fig. 17
Fig. 17
Distribution of prediction errors for varying dataset sizes in the voting classifier.

Similar articles

References

    1. Fieggen, J., Smith, E., Arora, L. & Segal, B. The role of machine learning in HIV risk prediction. Front. Reproductive Health4, (2022). - PMC - PubMed
    1. Tu, W. et al. Machine learning models reveal neurocognitive impairment type and prevalence are associated with distinct variables in HIV/AIDS. J. Neurovirol. 26, 41–51 (2020). - PubMed
    1. Seboka, B. T., Yehualashet, D. E. & Tesfa, G. A. Artificial intelligence and machine learning based prediction of viral load and CD4 status of people living with HIV (PLWH) on Anti-Retroviral treatment in Gedeo zone public hospitals. Int. J. Gen. Med.16, 435–451 (2023). - PMC - PubMed
    1. Maman, A., Pacal, I. & Bati, F. Can deep learning effectively diagnose cardiac amyloidosis with 99mTc-PYP scintigraphy? J. Radioanal Nucl. Chem.334, 1033–1048 (2025).
    1. Bayram, B., Kunduracioglu, I., Ince, S. & Pacal, I. A systematic review of deep learning in MRI-based cerebral vascular occlusion-based brain diseases. Neuroscience vol. 568 76–94 Preprint at (2025). 10.1016/j.neuroscience.2025.01.020 - PubMed