Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 11:9:10263-10281.
doi: 10.1109/ACCESS.2021.3050852. eCollection 2021.

A Novel Bayesian Optimization-Based Machine Learning Framework for COVID-19 Detection From Inpatient Facility Data

Affiliations

A Novel Bayesian Optimization-Based Machine Learning Framework for COVID-19 Detection From Inpatient Facility Data

Md Abdul Awal et al. IEEE Access. .

Abstract

The whole world faces a pandemic situation due to the deadly virus, namely COVID-19. It takes considerable time to get the virus well-matured to be traced, and during this time, it may be transmitted among other people. To get rid of this unexpected situation, quick identification of COVID-19 patients is required. We have designed and optimized a machine learning-based framework using inpatient's facility data that will give a user-friendly, cost-effective, and time-efficient solution to this pandemic. The proposed framework uses Bayesian optimization to optimize the hyperparameters of the classifier and ADAptive SYNthetic (ADASYN) algorithm to balance the COVID and non-COVID classes of the dataset. Although the proposed technique has been applied to nine state-of-the-art classifiers to show the efficacy, it can be used to many classifiers and classification problems. It is evident from this study that eXtreme Gradient Boosting (XGB) provides the highest Kappa index of 97.00%. Compared to without ADASYN, our proposed approach yields an improvement in the kappa index of 96.94%. Besides, Bayesian optimization has been compared to grid search, random search to show efficiency. Furthermore, the most dominating features have been identified using SHapely Adaptive exPlanations (SHAP) analysis. A comparison has also been made among other related works. The proposed method is capable enough of tracing COVID patients spending less time than that of the conventional techniques. Finally, two potential applications, namely, clinically operable decision tree and decision support system, have been demonstrated to support clinical staff and build a recommender system.

Keywords: ADASYN; Bayesian optimization; COVID-19; classification; inpatient's facility data.

PubMed Disclaimer

Figures

FIGURE 1.
FIGURE 1.
Characteristics of the Sample.
FIGURE 2.
FIGURE 2.
Fill rate for all Variables.
FIGURE 3.
FIGURE 3.
The overall workflow of the classification of COVID-19. The first phase is collecting raw data followed by pre-processing, where the raw data is imputed, scaled, and most importantly, the imbalanced data is balanced using ADASYN algorithm. Secondly, the pre-processed data are split into the train and test set used by different classifiers to measure the classification performance. In the next step, Bayesian optimization has been implemented to tune the hyperparameters of the classifiers. This optimized classification model is then tested, and different performance metrics (accuracy, precision, Confusion matrix, ROC, 10-fold cross-validation, ANOVA, and multi-comparison test) have been used for evaluation. Finally, the important features have been efficiently traced using SHAP analysis.
FIGURE 4.
FIGURE 4.
ROC Curve with ADASYN.
FIGURE 5.
FIGURE 5.
ROC curve without ADASYN. Note that the optimized model has not been created by using a balanced dataset.
FIGURE 6.
FIGURE 6.
ROC Curve for COVID on original test data only using each model. The optimized model has been created by using a balanced dataset and then applied to the original test dataset.
FIGURE 7.
FIGURE 7.
Confusion matrix of the balanced model applied in (a) COVID test Dataset with ADASYN, (b) original COVID test Dataset only. Figure 7(a) depicts the percentage of the correct classification in with the first two diagonal cells generated by the trained network. The numbers of patients who are correctly classified as a COVID and non-COVID were 3150 and 3233, corresponding to 48.7% and 49.9% in each group’s patients, respectively. Likewise, the numbers of patients who are incorrectly classified as a COVID and non-COVID were 24 and 67, with 0.4% and 1.0% correspondingly among all patients in each group. Similarly, the overall 99.2% were correctly, and 0.8% were incorrectly classified COVID, and non-COVID were overall, 98.0% and 2.0% correctly and incorrectly classified accordingly. In the case of prediction, the correct overall predictions for COVID and non-COVID were 97.9% and 99.3%, respectively. On the other hand, the incorrect results for COVID and non-COVID were 2.1% and 0.7%. Similarly, we can also interpret Figure 7(b).
FIGURE 8.
FIGURE 8.
Box-plot for (a) COVID Dataset and (b) multi-comparison test. Note that (b) is a graphical user interface tool by which one can test the statistical significance of any classifiers. Here we only show the effect of XGB. The effect of other classifiers can also be interpreted in the same way.
FIGURE 9.
FIGURE 9.
Recall rate vs. decision boundary curve for (a) COVID positive and (b) COVID negative.
FIGURE 10.
FIGURE 10.
Bootstrap ROC curve of the COVID dataset using XGB with 95% CI.
FIGURE 11.
FIGURE 11.
Feature importance plot using SHAP for XGB.
FIGURE 12.
FIGURE 12.
The SHAP variable importance plot of training data using XGB.
FIGURE 13.
FIGURE 13.
Comparative optimization techniques applied to the XGB model.
FIGURE 14.
FIGURE 14.
Box-plot of Bayesian optimization and Harris Hawks optimization.
FIGURE 15.
FIGURE 15.
A decision rule using four key features and their thresholds in absolute value.
FIGURE 16.
FIGURE 16.
Probabilistic output for the DSS. In the upper figure, the 0 has represented a subject with COVID negative, whereas 1 represented a subject with COVID positive. The lower figure represents a probabilistic outcome of the subject affected by COVID, where the red dotted line defines the threshold level. When the patient data level exceeds this threshold level, then the subject will be considered as COVID positive. Whereas the subject with the probability of less than 0.5, i.e., the threshold value, will be regarded as COVID negative. In either way, we can say that this the chance that a person is affected by COVID.

Similar articles

Cited by

References

    1. Mental Health and Psychosocial Considerations During the COVID-19 Outbreak, World Health Org., Geneva, Switzerland, 2020. Accessed: Mar. 18, 2020.
    1. Coronavirus Disease 2019 (COVID-19): Situation Report 88, World Health Org., Geneva, Switzerland, 2020.
    1. Jebril N. M. T., “World Health Organization declared a pandemic public health menace: A systematic review of the coronavirus disease 2019‘COVID-19,”’ Int. J. Psychosocial Rehabil., vol. 24, no. 9, pp. 2784–2795, May 2020.
    1. WHO. Coronavirus Disease (COVID-19) Dashboard. Accessed: Aug. 4, 2020. [Online]. Available: https://covid19.who.int/
    1. Van Doremalen N., Bushmaker T., Morris D. H., Holbrook M. G., Gamble A., Williamson B. N., Tamin A., Harcourt J. L., Thornburg N. J., Gerber S. I., and Lloyd-Smith J. O., “Aerosol and surface stability of SARS-CoV-2 as compared with SARS-CoV-1,” New England J. Med., vol. 382, no. 16, pp. 1564–1567, 2020. - PMC - PubMed

LinkOut - more resources