Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets

doi:10.7717/peerj-cs.670

. 2021 Aug 12:7:e670.

doi: 10.7717/peerj-cs.670. eCollection 2021.

Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets

Marcio Dorn^{1

2

3}, Bruno Iochins Grisci¹, Pedro Henrique Narloch¹, Bruno César Feltes^{1

4}, Eduardo Avila^{3

5}, Alessandro Kahmann⁶, Clarice Sampaio Alho^{3

5}

Affiliations

¹ Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil.
² Center of Biotechnology, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil.
³ Forensic Science, National Institute of Science and Technology, Porto Alegre, RS, Brazil.
⁴ Department of Genetics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil.
⁵ School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, RS, Brazil.
⁶ Institute of Mathematics, Statistics and Physics, Federal University of Rio Grande, Rio Grande, RS, Brazil.

PMID: 34458574
PMCID: PMC8372002
DOI: 10.7717/peerj-cs.670

Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets

Marcio Dorn et al. PeerJ Comput Sci. 2021.

. 2021 Aug 12:7:e670.

doi: 10.7717/peerj-cs.670. eCollection 2021.

Authors

Marcio Dorn^{1

2

3}, Bruno Iochins Grisci¹, Pedro Henrique Narloch¹, Bruno César Feltes^{1

4}, Eduardo Avila^{3

5}, Alessandro Kahmann⁶, Clarice Sampaio Alho^{3

5}

Affiliations

¹ Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil.
² Center of Biotechnology, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil.
³ Forensic Science, National Institute of Science and Technology, Porto Alegre, RS, Brazil.
⁴ Department of Genetics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil.
⁵ School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, RS, Brazil.
⁶ Institute of Mathematics, Statistics and Physics, Federal University of Rio Grande, Rio Grande, RS, Brazil.

PMID: 34458574
PMCID: PMC8372002
DOI: 10.7717/peerj-cs.670

Abstract

The Coronavirus pandemic caused by the novel SARS-CoV-2 has significantly impacted human health and the economy, especially in countries struggling with financial resources for medical testing and treatment, such as Brazil's case, the third most affected country by the pandemic. In this scenario, machine learning techniques have been heavily employed to analyze different types of medical data, and aid decision making, offering a low-cost alternative. Due to the urgency to fight the pandemic, a massive amount of works are applying machine learning approaches to clinical data, including complete blood count (CBC) tests, which are among the most widely available medical tests. In this work, we review the most employed machine learning classifiers for CBC data, together with popular sampling methods to deal with the class imbalance. Additionally, we describe and critically analyze three publicly available Brazilian COVID-19 CBC datasets and evaluate the performance of eight classifiers and five sampling techniques on the selected datasets. Our work provides a panorama of which classifier and sampling methods provide the best results for different relevant metrics and discuss their impact on future analyses. The metrics and algorithms are introduced in a way to aid newcomers to the field. Finally, the panorama discussed here can significantly benefit the comparison of the results of new ML algorithms.

Keywords: Covid; Data mining; Hemogram; Imbalanced datasets; Machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 1. Methodological steps used in this work.**

Figure 2. Distributions of white blood cells related variables for positive (purple) and negative (green) classes of the three datasets: Albert Einstein Hospital (HAE), Fleury Group (FLE), and Sírio-Libanês Hospital (HSL). The central white dot is the median.

Figure 3. Distributions of red blood cells related variables for positive (purple) and negative (green) classes of the three datasets: Albert Einstein Hospital (HAE), Fleury Group (FLE), and Sírio-Libanês Hospital (HSL). The central white dot is the median.

Figure 4. Visualization of the negative (purple) and positive (green) samples from the Albert Einstein Hospital (AE), Fleury Laboratory (FLEURY) and Hospital Sirio Libanês (HSL) using t-SNE for all the different sampling schemes.

Figure 5. Average test results from 31 independent runs for several classifiers and sampling schemes trained on the Albert Einstein Hospital data. Black lines represent the standard deviation, while the white circle represents the median. (A) Sensitivity; (B) Specificity; (C) LR+; (D) LR−; (E) DOR; (F) F1 Score; (G) ROC-AUC Score.

Figure 6. Average test from 31 independent runs for several classifiers and sampling schemes trained on the Fleury Group data. Black lines represent the standard deviation, while the white circle represents the median. (A) Sensitivity; (B) Specificity; (C) LR+; (D) LR−; (E) DOR; (F) F1 Score; (G) ROC-AUC Score.

Figure 7. Average test results from 31 independent runs for several classifiers and sampling schemes trained on the Sírio-Libanês Hospital. Black lines represent the standard deviation, while the white circle represents the median. (A) Sensitivity; (B) Specificity; (C) LR+; (D) LR−; (E) DOR; (F) F1 Score; (G) ROC-AUC Score.

See this image and copyright information in PMC

Cited by

Comparative performance of twelve machine learning models in predicting COVID-19 mortality risk in children: a population-based retrospective cohort study in Brazil.
Lages Dos Santos A, Oliveira MCL, Colosimo EA, Mak RH, Pinhati CC, Gallante SC, Martelli-Júnior H, Simões E Silva AC, Oliveira EA. Lages Dos Santos A, et al. PeerJ Comput Sci. 2025 May 28;11:e2916. doi: 10.7717/peerj-cs.2916. eCollection 2025. PeerJ Comput Sci. 2025. PMID: 40567691 Free PMC article.
Comparing machine learning algorithms to predict COVID‑19 mortality using a dataset including chest computed tomography severity score data.
Zakariaee SS, Naderi N, Ebrahimi M, Kazemi-Arpanahi H. Zakariaee SS, et al. Sci Rep. 2023 Jul 13;13(1):11343. doi: 10.1038/s41598-023-38133-6. Sci Rep. 2023. PMID: 37443373 Free PMC article.
Machine learning approaches to predict the need for intensive care unit admission among Iranian COVID-19 patients based on ICD-10: A cross-sectional study.
Karimi Z, Malak JS, Aghakhani A, Najafi MS, Ariannejad H, Zeraati H, Yekaninejad MS. Karimi Z, et al. Health Sci Rep. 2024 Sep 2;7(9):e70041. doi: 10.1002/hsr2.70041. eCollection 2024 Sep. Health Sci Rep. 2024. PMID: 39229475 Free PMC article.
Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers.
C L, S P, Kashyap AH, Rahaman A, Niranjan S, Niranjan V. C L, et al. Cancer Inform. 2023 Apr 21;22:11769351231167992. doi: 10.1177/11769351231167992. eCollection 2023. Cancer Inform. 2023. PMID: 37113644 Free PMC article.
COVID-19 health data prediction: a critical evaluation of CNN-based approaches.
Kim TH, Chinthaginjala R, Srinivasulu A, Tera SP, Rab SO. Kim TH, et al. Sci Rep. 2025 Mar 17;15(1):9121. doi: 10.1038/s41598-025-92464-0. Sci Rep. 2025. PMID: 40097568 Free PMC article.

See all "Cited by" articles

References

1. Alimadadi A, Aryal S, Manandhar I, Munroe PB, Joe B, Cheng X. Artificial intelligence and machine learning to fight COVID-19. Physiological Genomics. 2020;52(4):200–202. doi: 10.1152/physiolgenomics.00029.2020. - DOI - PMC - PubMed
1. AlJame M, Ahmad I, Imtiaz A, Mohammed A. Ensemble learning model for diagnosing COVID-19 from routine blood tests. Informatics in Medicine Unlocked. 2020;21:100449. - PMC - PubMed
1. Alves MA, Castro GZ, Oliveira BAS, Ferreira LA, Ramrez JA, Silva R, Guimarães FG. Explaining machine learning based diagnosis of COVID-19 from routine blood tests with decision trees and criteria graphs. Computers in Biology and Medicine. 2021;132:104335. doi: 10.1016/j.compbiomed.2021.104335. - DOI - PMC - PubMed
1. Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2016;13(5):971–989. doi: 10.1109/TCBB.2015.2478454. - DOI - PubMed
1. Anzanello M, Kahmann A, Marcelo M, Mariotti K, Ferrão M, Ortiz R. Multicriteria wavenumber selection in cocaine classification. Journal of Pharmaceutical and Biomedical Analysis. 2015;115:562–569. doi: 10.1016/j.jpba.2015.08.008. - DOI - PubMed

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

[1] Alimadadi A, Aryal S, Manandhar I, Munroe PB, Joe B, Cheng X. Artificial intelligence and machine learning to fight COVID-19. Physiological Genomics. 2020;52(4):200–202. doi: 10.1152/physiolgenomics.00029.2020. - DOI - PMC - PubMed

[2] Alimadadi A, Aryal S, Manandhar I, Munroe PB, Joe B, Cheng X. Artificial intelligence and machine learning to fight COVID-19. Physiological Genomics. 2020;52(4):200–202. doi: 10.1152/physiolgenomics.00029.2020. - DOI - PMC - PubMed

[3] AlJame M, Ahmad I, Imtiaz A, Mohammed A. Ensemble learning model for diagnosing COVID-19 from routine blood tests. Informatics in Medicine Unlocked. 2020;21:100449. - PMC - PubMed

[4] AlJame M, Ahmad I, Imtiaz A, Mohammed A. Ensemble learning model for diagnosing COVID-19 from routine blood tests. Informatics in Medicine Unlocked. 2020;21:100449. - PMC - PubMed

[5] Alves MA, Castro GZ, Oliveira BAS, Ferreira LA, Ramrez JA, Silva R, Guimarães FG. Explaining machine learning based diagnosis of COVID-19 from routine blood tests with decision trees and criteria graphs. Computers in Biology and Medicine. 2021;132:104335. doi: 10.1016/j.compbiomed.2021.104335. - DOI - PMC - PubMed

[6] Alves MA, Castro GZ, Oliveira BAS, Ferreira LA, Ramrez JA, Silva R, Guimarães FG. Explaining machine learning based diagnosis of COVID-19 from routine blood tests with decision trees and criteria graphs. Computers in Biology and Medicine. 2021;132:104335. doi: 10.1016/j.compbiomed.2021.104335. - DOI - PMC - PubMed

[7] Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2016;13(5):971–989. doi: 10.1109/TCBB.2015.2478454. - DOI - PubMed

[8] Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2016;13(5):971–989. doi: 10.1109/TCBB.2015.2478454. - DOI - PubMed

[9] Anzanello M, Kahmann A, Marcelo M, Mariotti K, Ferrão M, Ortiz R. Multicriteria wavenumber selection in cocaine classification. Journal of Pharmaceutical and Biomedical Analysis. 2015;115:562–569. doi: 10.1016/j.jpba.2015.08.008. - DOI - PubMed

[10] Anzanello M, Kahmann A, Marcelo M, Mariotti K, Ferrão M, Ortiz R. Multicriteria wavenumber selection in cocaine classification. Journal of Pharmaceutical and Biomedical Analysis. 2015;115:562–569. doi: 10.1016/j.jpba.2015.08.008. - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets

Affiliations

Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous