Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 26:8:196299-196325.
doi: 10.1109/ACCESS.2020.3034032. eCollection 2020.

Explainable Machine Learning for Early Assessment of COVID-19 Risk Prediction in Emergency Departments

Affiliations

Explainable Machine Learning for Early Assessment of COVID-19 Risk Prediction in Emergency Departments

Elena Casiraghi et al. IEEE Access. .

Abstract

Between January and October of 2020, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus has infected more than 34 million persons in a worldwide pandemic leading to over one million deaths worldwide (data from the Johns Hopkins University). Since the virus begun to spread, emergency departments were busy with COVID-19 patients for whom a quick decision regarding in- or outpatient care was required. The virus can cause characteristic abnormalities in chest radiographs (CXR), but, due to the low sensitivity of CXR, additional variables and criteria are needed to accurately predict risk. Here, we describe a computerized system primarily aimed at extracting the most relevant radiological, clinical, and laboratory variables for improving patient risk prediction, and secondarily at presenting an explainable machine learning system, which may provide simple decision criteria to be used by clinicians as a support for assessing patient risk. To achieve robust and reliable variable selection, Boruta and Random Forest (RF) are combined in a 10-fold cross-validation scheme to produce a variable importance estimate not biased by the presence of surrogates. The most important variables are then selected to train a RF classifier, whose rules may be extracted, simplified, and pruned to finally build an associative tree, particularly appealing for its simplicity. Results show that the radiological score automatically computed through a neural network is highly correlated with the score computed by radiologists, and that laboratory variables, together with the number of comorbidities, aid risk prediction. The prediction performance of our approach was compared to that that of generalized linear models and shown to be effective and robust. The proposed machine learning-based computational system can be easily deployed and used in emergency departments for rapid and accurate risk prediction in COVID-19 patients.

Keywords: Associative tree; Boruta feature selection; COVID-19; clinical data analysis; generalized linear models; missing data imputation; random forest classifier; risk prediction.

PubMed Disclaimer

Figures

FIGURE 1.
FIGURE 1.
Histogram of missing values for each sample: the maximum number of missing values is 12, corresponding to 25% of the variables. Only one sample has 12 missing values.
FIGURE 2.
FIGURE 2.
Missing data patterns. (left) Proportion of missing values for all variables in the dataset, sorted by decreasing order. (right) Combinations of missing values: red squares in a matrix entry denote the presence of missing values for the variable associated to the column in the samples corresponding to the row; the bars on the right show the cardinality of each set of points.
FIGURE 3.
FIGURE 3.
Between-imputation variances computed on 100 datasets imputed with distFree. Dots and triangles mark the variances computed using increasing and decreasing imputation order, respectively.
FIGURE 4.
FIGURE 4.
Between-imputation variances computed on the 100 datasets imputed with micePMM (left) and miceRF (right), using the same scale for Y axis. Same notations as in Fig. 3.
FIGURE 5.
FIGURE 5.
Between-imputation variances computed on the 100 datasets imputed with missForest. The obtained values are negligible, as highlighted by the span of Y axis: this practically means that the imputed values are always similar. Same notations as in Fig. 3.
FIGURE 6.
FIGURE 6.
An associative tree. The tree consecutively evaluates all the conditions, until a condition is met, bringing to a decision.
FIGURE 7.
FIGURE 7.
Top: estimates (and standard errors) of the feature relevance computed by RFs. Bottom: estimates (and standard errors) of the feature coefficients computed by GLMs. Only the significant feature relevances/coefficients are plotted.
FIGURE 8.
FIGURE 8.
Relevance/coefficient estimates computed by RFs and GLMs. Only the significant estimates are reported. For GLMs, red bars highlight negative coefficients (that is variables, inversely related to the risk).
FIGURE 9.
FIGURE 9.
Global, significant estimates of pooled correlation coefficients between each feature and the label computed on the 50 sets imputed by missForest.
FIGURE 10.
FIGURE 10.
Pooled significant estimates of feature importance computed by RFs on the 50 sets (imputed by missForest) when saturation variables are removed.
FIGURE 11.
FIGURE 11.
Pooled pairwise (Perason, spearman, and Kendall’s) correlation coefficients between pair of variables computed over the 50 datasets imputed by missForest.

References

    1. Huang C.et al., “Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China,” Lancet, vol. 395, no. 10223, pp. 497–506, 2020. - PMC - PubMed
    1. Huang Z., Dong W., Ji L., and Duan H., “Outcome prediction in clinical treatment processes,” J. Med. Syst., vol. 40, no. 1, p. 8, Jan. 2016, doi: 10.1007/s10916-015-0380-6. - DOI - PubMed
    1. Gliozzo J., Perlasca P., Mesiti M., Casiraghi E., Vallacchi V., Vergani E., Frasca M., Grossi G., Petrini A., Re M., Paccanaro A., and Valentini G., “Network modeling of patients’ biomolecular profiles for clinical phenotype/outcome prediction,” Sci. Rep., vol. 10, no. 1, Dec. 2020, Art. no. 3612, doi: 10.1038/s41598-020-60235-8. - DOI - PMC - PubMed
    1. Barricelli B. R., Casiraghi E., Gliozzo J., Petrini A., and Valtolina S., “Human digital twin for fitness management,” IEEE Access, vol. 8, pp. 26637–26664, 2020.
    1. Fong S., Dey N., and Chaki J., Artificial Intelligence for Coronavirus Outbreak (Springer Briefs in Computational Intelligence). Singapore: Springer, 2020.

LinkOut - more resources