Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar:154:106619.
doi: 10.1016/j.compbiomed.2023.106619. Epub 2023 Feb 1.

Explainable artificial intelligence model for identifying COVID-19 gene biomarkers

Affiliations

Explainable artificial intelligence model for identifying COVID-19 gene biomarkers

Fatma Hilal Yagin et al. Comput Biol Med. 2023 Mar.

Abstract

Aim: COVID-19 has revealed the need for fast and reliable methods to assist clinicians in diagnosing the disease. This article presents a model that applies explainable artificial intelligence (XAI) methods based on machine learning techniques on COVID-19 metagenomic next-generation sequencing (mNGS) samples.

Methods: In the data set used in the study, there are 15,979 gene expressions of 234 patients with COVID-19 negative 141 (60.3%) and COVID-19 positive 93 (39.7%). The least absolute shrinkage and selection operator (LASSO) method was applied to select genes associated with COVID-19. Support Vector Machine - Synthetic Minority Oversampling Technique (SVM-SMOTE) method was used to handle the class imbalance problem. Logistics regression (LR), SVM, random forest (RF), and extreme gradient boosting (XGBoost) methods were constructed to predict COVID-19. An explainable approach based on local interpretable model-agnostic explanations (LIME) and SHAPley Additive exPlanations (SHAP) methods was applied to determine COVID-19- associated biomarker candidate genes and improve the final model's interpretability.

Results: For the diagnosis of COVID-19, the XGBoost (accuracy: 0.930) model outperformed the RF (accuracy: 0.912), SVM (accuracy: 0.877), and LR (accuracy: 0.912) models. As a result of the SHAP, the three most important genes associated with COVID-19 were IFI27, LGR6, and FAM83A. The results of LIME showed that especially the high level of IFI27 gene expression contributed to increasing the probability of positive class.

Conclusions: The proposed model (XGBoost) was able to predict COVID-19 successfully. The results show that machine learning combined with LIME and SHAP can explain the biomarker prediction for COVID-19 and provide clinicians with an intuitive understanding and interpretability of the impact of risk factors in the model.

Keywords: COVID-19; Explainable artificial intelligence; LIME; SHAP; XGBoost.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare that there is no conflict of interest.

Figures

Fig. 1
Fig. 1
Diagram of the proposed method combining explainability and classifier.
Fig. 2
Fig. 2
The plot of results of ML models for COVID-19 (95% confidence interval (CI)).
Fig. 3
Fig. 3
Interpretation of the XGBoost model. (A): Ranking the importance of the top 20 risk factors with stability and interpretation using the optimal model. (B): The order of importance of the first 20 variables according to the mean (|SHAP value|); the higher the SHAP value of a trait is given, the higher the probability that the patient will be COVID-19 positive.
Fig. 4
Fig. 4
Variation of the five most important gene expression levels between groups.
Fig. 5
Fig. 5
Radar plot of the five most important genes. - It is a COVID-19 patient appearing in orange on the Radar plot and a patient in the control group appearing in blue.
Fig. 6
Fig. 6
Local interpretable model-agnostic explanations.
Image 1
Image 2

References

    1. Smith M., Alvarez F. Identifying mortality factors from Machine Learning using Shapley values–a case of COVID19. Expert Syst. Appl. 2021;176 - PMC - PubMed
    1. Wu J., Shen J., Xu M., Shao M. A novel combined dynamic ensemble selection model for imbalanced data to detect COVID-19 from complete blood count. Comput. Methods Progr. Biomed. 2021 - PMC - PubMed
    1. Humayun A., Shahabuddin S., Afzal S., Malik A.A., Atique S., Iqbal U. Healthcare strategies and initiatives about COVID19 in Pakistan: telemedicine a way to look forward. Comput. Methods Progr. Biomed.Update. 2021;1 - PMC - PubMed
    1. Padmanabhan R., Abed H.S., Meskin N., Khattab T., Shraim M., Al-Hitmi M.A. A review of mathematical model-based scenario analysis and interventions for COVID-19. Comput. Methods Progr. Biomed. 2021 - PMC - PubMed
    1. Ravizza A., Sternini F., Molinari F., Santoro E., Cabitza F. A proposal for COVID-19 applications enabling extensive epidemiological studies. Procedia Comput. Sci. 2021;181:589–596. - PMC - PubMed

Publication types