Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Observational Study
. 2019 Jul;28(7):2045-2051.
doi: 10.1016/j.jstrokecerebrovasdis.2019.02.004. Epub 2019 May 15.

Automating Ischemic Stroke Subtype Classification Using Machine Learning and Natural Language Processing

Affiliations
Observational Study

Automating Ischemic Stroke Subtype Classification Using Machine Learning and Natural Language Processing

Ravi Garg et al. J Stroke Cerebrovasc Dis. 2019 Jul.

Abstract

Objective: The manual adjudication of disease classification is time-consuming, error-prone, and limits scaling to large datasets. In ischemic stroke (IS), subtype classification is critical for management and outcome prediction. This study sought to use natural language processing of electronic health records (EHR) combined with machine learning methods to automate IS subtyping.

Methods: Among IS patients from an observational registry with TOAST subtyping adjudicated by board-certified vascular neurologists, we analyzed unstructured text-based EHR data including neurology progress notes and neuroradiology reports using natural language processing. We performed several feature selection methods to reduce the high dimensionality of the features and 5-fold cross validation to test generalizability of our methods and minimize overfitting. We used several machine learning methods and calculated the kappa values for agreement between each machine learning approach to manual adjudication. We then performed a blinded testing of the best algorithm against a held-out subset of 50 cases.

Results: Compared to manual classification, the best machine-based classification achieved a kappa of .25 using radiology reports alone, .57 using progress notes alone, and .57 using combined data. Kappa values varied by subtype being highest for cardioembolic (.64) and lowest for cryptogenic cases (.47). In the held-out test subset, machine-based classification agreed with rater classification in 40 of 50 cases (kappa .72).

Conclusions: Automated machine learning approaches using textual data from the EHR shows agreement with manual TOAST classification. The automated pipeline, if externally validated, could enable large-scale stroke epidemiology research.

Keywords: Ischemic stroke; cardioembolism; cryptogenic; machine learning; natural language processing.

PubMed Disclaimer

Publication types

MeSH terms

LinkOut - more resources