Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar 20;10(6):1286.
doi: 10.3390/jcm10061286.

Prediction of Long-Term Stroke Recurrence Using Machine Learning Models

Affiliations

Prediction of Long-Term Stroke Recurrence Using Machine Learning Models

Vida Abedi et al. J Clin Med. .

Abstract

Background: The long-term risk of recurrent ischemic stroke, estimated to be between 17% and 30%, cannot be reliably assessed at an individual level. Our goal was to study whether machine-learning can be trained to predict stroke recurrence and identify key clinical variables and assess whether performance metrics can be optimized.

Methods: We used patient-level data from electronic health records, six interpretable algorithms (Logistic Regression, Extreme Gradient Boosting, Gradient Boosting Machine, Random Forest, Support Vector Machine, Decision Tree), four feature selection strategies, five prediction windows, and two sampling strategies to develop 288 models for up to 5-year stroke recurrence prediction. We further identified important clinical features and different optimization strategies.

Results: We included 2091 ischemic stroke patients. Model area under the receiver operating characteristic (AUROC) curve was stable for prediction windows of 1, 2, 3, 4, and 5 years, with the highest score for the 1-year (0.79) and the lowest score for the 5-year prediction window (0.69). A total of 21 (7%) models reached an AUROC above 0.73 while 110 (38%) models reached an AUROC greater than 0.7. Among the 53 features analyzed, age, body mass index, and laboratory-based features (such as high-density lipoprotein, hemoglobin A1c, and creatinine) had the highest overall importance scores. The balance between specificity and sensitivity improved through sampling strategies.

Conclusion: All of the selected six algorithms could be trained to predict the long-term stroke recurrence and laboratory-based variables were highly associated with stroke recurrence. The latter could be targeted for personalized interventions. Model performance metrics could be optimized, and models can be implemented in the same healthcare system as intelligent decision support for targeted intervention.

Keywords: artificial intelligence; clinical decision support system; electronic health record; explainable machine learning; healthcare; interpretable machine learning; ischemic stroke; machine learning; outcome prediction; recurrent stroke.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
(A) Flow-chart of inclusion-exclusion of subjects in cases and control group in the study. Patients in the control group had available records in the electronic health record for at least 5 years and no documented stroke recurrence within 5 years. Distribution panel shows the number of recurrences over time. At 24 days, the number of recurrent cases can be seen to approach a plateau. (B) The design strategy for predicting stroke recurrence using electronic health records (EHR), Geisinger Quality database as well as Social Security Death database.
Figure 2
Figure 2
Model performance summaries for the five different prediction windows, six different classifiers, and four feature selection approaches. Performance metrics for (AF) Decision tree, (GL) Gradient Boost, (MR) Logistic Regression, (SX) Random Forest, (YAD) SVM, and (AEAJ) XGBoost.
Figure 3
Figure 3
Area under the receiver operating characteristic (AROC) curve using six classifiers for the 1-year prediction window. The feature Set 3 is used for this figure. (A) Model without sampling; (B) Model with up-sampling at a 1:2 ratio; (C) Model with up-sampling at a 1:1 ratio. The best performer model (AUROC of 0.79) is when up-sampling is used with Random Forest algorithm (panel B).
Figure 4
Figure 4
Feature importance based on the different trained models. (AE) Six different classifiers (Gradient Boost, Random Forest, Extreme Gradient Boosting (XGBoost), Decision Trees, Support Vector Machine (SVM), and Logistic Regression) and five different prediction windows were used. (F) Average feature importance score across the different models and prediction windows.
Figure 5
Figure 5
Model Performance summaries with sampling-based optimization for the 1 and 3-year prediction window. Up-sampling using was performed using the Synthetic Minority Over-sampling Technique (SMOTE). The feature Set 3 is used for this figure. (AF) Model without sampling; (GL) Model with down-sampling; (MR) Model with up-sampling.

Similar articles

Cited by

References

    1. Katan M., Luft A. Global Burden of Stroke. Semin. Neurol. 2018;38:208–211. doi: 10.1055/s-0038-1649503. - DOI - PubMed
    1. Benjamin E.J., Blaha M.J., Chiuve S.E., Cushman M., Das S.R., de Ferranti S.D., Floyd J., Fornage M., Gillespie C., Isasi C.R., et al. Heart disease and stroke statistics—2017 update a report from the American heart association. Circulation. 2017;135:e146–e603. doi: 10.1161/CIR.0000000000000485. - DOI - PMC - PubMed
    1. Burn J., Dennis M., Bamford J., Sandercock P., Wade D., Warlow C. Long-term risk of recurrent stroke after a first-ever stroke. The Oxfordshire Community Stroke Project. Stroke. 1994;25:333–337. doi: 10.1161/01.STR.25.2.333. - DOI - PubMed
    1. Hillen T., Coshall C., Tilling K., Rudd A.G., McGovern R., Wolfe C.D. Cause of Stroke Recurrence Is Multifactorial. Stroke. 2003;34:1457–1463. doi: 10.1161/01.STR.0000072985.24967.7F. - DOI - PubMed
    1. Samsa G.P., Bian J., Lipscomb J., Matchar D.B. Epidemiology of Recurrent Cerebral Infarction. Stroke. 1999;30:338–349. doi: 10.1161/01.STR.30.2.338. - DOI - PubMed