Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 8;23(4):e27293.
doi: 10.2196/27293.

Classification Models for COVID-19 Test Prioritization in Brazil: Machine Learning Approach

Affiliations

Classification Models for COVID-19 Test Prioritization in Brazil: Machine Learning Approach

Íris Viana Dos Santos Santana et al. J Med Internet Res. .

Abstract

Background: Controlling the COVID-19 outbreak in Brazil is a challenge due to the population's size and urban density, inefficient maintenance of social distancing and testing strategies, and limited availability of testing resources.

Objective: The purpose of this study is to effectively prioritize patients who are symptomatic for testing to assist early COVID-19 detection in Brazil, addressing problems related to inefficient testing and control strategies.

Methods: Raw data from 55,676 Brazilians were preprocessed, and the chi-square test was used to confirm the relevance of the following features: gender, health professional, fever, sore throat, dyspnea, olfactory disorders, cough, coryza, taste disorders, and headache. Classification models were implemented relying on preprocessed data sets; supervised learning; and the algorithms multilayer perceptron (MLP), gradient boosting machine (GBM), decision tree (DT), random forest (RF), extreme gradient boosting (XGBoost), k-nearest neighbors (KNN), support vector machine (SVM), and logistic regression (LR). The models' performances were analyzed using 10-fold cross-validation, classification metrics, and the Friedman and Nemenyi statistical tests. The permutation feature importance method was applied for ranking the features used by the classification models with the highest performances.

Results: Gender, fever, and dyspnea were among the highest-ranked features used by the classification models. The comparative analysis presents MLP, GBM, DT, RF, XGBoost, and SVM as the highest performance models with similar results. KNN and LR were outperformed by the other algorithms. Applying the easy interpretability as an additional comparison criterion, the DT was considered the most suitable model.

Conclusions: The DT classification model can effectively (with a mean accuracy≥89.12%) assist COVID-19 test prioritization in Brazil. The model can be applied to recommend the prioritizing of a patient who is symptomatic for COVID-19 testing.

Keywords: COVID-19; classification models; medical diagnosis; test prioritization.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Overview of the research methodology applied for the study. The methodological steps consist of data preprocessing, the definition of new data sets, English translation, feature selection, 10-fold cross-validation, statistical comparisons, and feature ranking. AUPR: area under the precision-recall curve; AUROC: area under the receiver operating characteristic curve; DT: decision tree; GBM: gradient boosting machine; KNN: k-nearest neighbors; LR: logistic regression (weak regularization); LRR: logistic regression (strong regularization); MLP: multilayer perceptron; RF: random forest; RT-PCR: reverse transcription polymerase chain reaction; SVM: support vector machine; XGBoost: extreme gradient boosting.
Figure 2
Figure 2
Correlation matrix for (A) RT-PCR unbalanced data set, (B) RT-PCR balanced data set, (C) rapid unbalanced data set, (D) rapid balanced data set, (E) both unbalanced data set, and (F) both balanced data set. RT-PCR: reverse transcription polymerase chain reaction.
Figure 3
Figure 3
(A) The frequency of symptoms for the 20,021 patients who were symptomatic of the both unbalanced data set and the number of CCs. Top values are frequencies; numbers on the geometric forms are the CC for frequency. (B) The frequency of symptoms for the 3128 patients who were symptomatic of the both balanced data set and the number of CCs.
Figure 4
Figure 4
The models' ROC curves with (A) RT-PCR unbalanced, (B) RT-PCR balanced, (C) rapid unbalanced, (D) rapid balanced, (E) both unbalanced, and (F) both balanced. AUC: area under the receiver operating characteristic curve; GBM: gradient boosting machine; KNN: k-nearest neighbors; LR: logistic regression (weak regularization); LRR: logistic regression (strong regularization); Mlp: multilayer perceptron; ROC: receiver operating characteristic; RT-PCR: reverse transcription polymerase chain reaction; SVM: support vector machine; XGBoost: extreme gradient boosting.
Figure 5
Figure 5
Models' precision-recall curve with (A) RT-PCR unbalanced data set, (B) rapid unbalanced data set, and (C) both unbalanced data set. AP: average precision; GBM: gradient boosting machine; KNN: k-nearest neighbors; LR: logistic regression (weak regularization); LRR: logistic regression (strong regularization); Mlp: multilayer perceptron; PR: precision-recall; RT-PCR: reverse transcription polymerase chain reaction; SVM: support vector machine; XGBoost: extreme gradient boosting.
Figure 6
Figure 6
(A) The mean recall for the MLP, GBM, RF, DT, XGBoost, KNN, SVM, LRR, and LR classification models using the unbalanced data sets for RT-PCR, rapid, and both types. (B) The mean recall for the MLP, GBM, RF, DT, XGBoost, KNN, SVM, LRR, and LR classification models using the balanced data sets for RT-PCR, rapid, and both types. DT: decision tree; GBM: gradient boosting machine; KNN: k-nearest neighbors; LR: logistic regression (weak regularization); LRR: logistic regression (strong regularization); MLP: multilayer perceptron; RF: random forest; RT-PCR: reverse transcription polymerase chain reaction; SVM: support vector machine; XGBoost: extreme gradient boosting.
Figure 7
Figure 7
An application scenario to connect the decision tree classification model with a clinical workflow. The model guides the test prioritization of patients who were symptomatic suspected of COVID-19. RT-PCR: reverse transcription polymerase chain reaction.

Similar articles

Cited by

References

    1. Belard A, Buchman T, Forsberg J, Potter BK, Dente CJ, Kirk A, Elster E. Precision diagnosis: a view of the clinical decision support systems (CDSS) landscape through the lens of critical care. J Clin Monit Comput. 2017 Apr;31(2):261–271. doi: 10.1007/s10877-016-9849-1. - DOI - PubMed
    1. Elhoseny M, Abdelaziz A, Salama AS, Riad A, Muhammad K, Sangaiah AK. A hybrid model of Internet of Things and cloud computing to manage big data in health services applications. Future Generation Computer Syst. 2018 Sep;86:1383–1394. doi: 10.1016/j.future.2018.03.005. - DOI
    1. Chatterjee A, Gerdes MW, Martinez S. eHealth initiatives for the promotion of healthy lifestyle and allied implementation difficulties. International Conference on Wireless and Mobile Computing, Networking and Communications; October 21-23, 2019; Barcelona, Spain. 2019. pp. 1–8. - DOI
    1. Zoabi Y, Deri-Rozov S, Shomron N. Machine learning-based prediction of COVID-19 diagnosis based on symptoms. NPJ Digit Med. 2021 Jan 04;4(1):3. doi: 10.1038/s41746-020-00372-6. - DOI - PMC - PubMed
    1. Guimarães VHA, de Oliveira-Leandro M, Cassiano C, Marques ALP, Motta C, Freitas-Silva AL, de Sousa MAD, Silveira LAM, Pardi TC, Gazotto FC, Silva MV, Rodrigues V, Rodrigues WF, Oliveira CJF. Knowledge about COVID-19 in Brazil: cross-sectional web-based study. JMIR Public Health Surveill. 2021 Jan 21;7(1):e24756. doi: 10.2196/24756. https://publichealth.jmir.org/2021/1/e24756/ - DOI - PMC - PubMed