A GPT-4o-powered framework for identifying cognitive impairment stages in electronic health records

Yu Leng^#¹, Yingnan He^#¹, Samad Amini², Colin Magdamo^{1

3}, Ioannis Paschalidis², Shibani S Mukerji^{1

3}, Lidia M V R Moura^{1

3}, M Brandon Westover^{3

4}, Ana-Maria Vranceanu^{3

5}, Christine S Ritchie^{3

6}, Deborah Blacker^{3

5

7}, John R Dickson^{1

3}, Sudeshna Das^{8

9}

Affiliations

¹ Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.
² Boston University, Boston, MA, USA.
³ Harvard Medical School, Boston, MA, USA.
⁴ Department of Neurology, Beth Israel Deaconess Medical Center, Boston, MA, USA.
⁵ Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA.
⁶ Mongan Institute Center for Aging and Serious Illness and the Division of Palliative Care and Geriatric Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁷ Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
⁸ Department of Neurology, Massachusetts General Hospital, Boston, MA, USA. sdas5@mgh.harvard.edu.
⁹ Harvard Medical School, Boston, MA, USA. sdas5@mgh.harvard.edu.

^# Contributed equally.

PMID: 40610683
PMCID: PMC12229571
DOI: 10.1038/s41746-025-01834-5

A GPT-4o-powered framework for identifying cognitive impairment stages in electronic health records

Yu Leng et al. NPJ Digit Med. 2025.

. 2025 Jul 3;8(1):401.

doi: 10.1038/s41746-025-01834-5.

Authors

Affiliations

¹ Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.
² Boston University, Boston, MA, USA.
³ Harvard Medical School, Boston, MA, USA.
⁴ Department of Neurology, Beth Israel Deaconess Medical Center, Boston, MA, USA.
⁵ Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA.
⁶ Mongan Institute Center for Aging and Serious Illness and the Division of Palliative Care and Geriatric Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁷ Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
⁸ Department of Neurology, Massachusetts General Hospital, Boston, MA, USA. sdas5@mgh.harvard.edu.
⁹ Harvard Medical School, Boston, MA, USA. sdas5@mgh.harvard.edu.

^# Contributed equally.

PMID: 40610683
PMCID: PMC12229571
DOI: 10.1038/s41746-025-01834-5

Abstract

Alzheimer's Disease and Related Dementias (ADRD) pose a major public health challenge, with a critical need for accurate and scalable tools for detecting cognitive impairment (CI). Readily available electronic health records (EHRs) contain valuable cognitive health data, but much of it is embedded in unstructured clinical notes. To address this problem, we developed a GPT-4o-powered framework for CI stage classification, leveraging longitudinal patient history summarization, multi-step reasoning, and confidence-aware decision-making. Evaluated on 165,926 notes from 1002 Medicare patients from Mass General Brigham (MGB), our GPT-4o framework achieved high accuracy in CI stage classification (weighted Cohen's kappa = 0.95, Spearman correlation = 0.93), and outperformed two other language models (weighted Cohen's kappa 0.82-0.85). Our framework also achieved high performance on Clinical Dementia Rating (CDR) scoring on an independent dataset of 769 memory clinic patients (weighted Cohen's kappa = 0.83). Finally, to ensure reliability and safety, we designed an interactive AI agent integrating our GPT-4o-powered framework and clinician oversight. This collaborative approach has the potential to facilitate CI diagnoses in real-world clinical settings.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Dr. Westover is a co-founder, scientific advisor, consultant to, and has personal equity interest in Beacon Biosignals.

Figures

**Fig. 1. Overview of the Workflow for Cognitive Impairment (CI) Staging Across Four Frameworks.**
a End-to-End **GPT-4o-Powered Framework**: our framework, an End-to-end GPT-4o approach using multi-note summaries, chunked and summarized further into a “summary of summaries,” outputting CI stage, summary and confidence level. b Other Three Frameworks for Comparison. **USE Framework**: Keyword-based sentence extraction with Universal Sentence Encoder (USE) embeddings, Recursive Feature Elimination (RFE), and XGBoost for classification. **DementiaBERT Framework**: Keyword-based sentence extraction with DementiaBERT embeddings (fine-tuned on dementia-related clinical language) and XGBoost classification. **Hybrid Framework**: GPT-4o-generated summaries of clinical notes, chunked and embedded using DementiaBERT, with XGBoost for classification. The figure was partially created in BioRender. He, Y. (2025) https://BioRender.com/4i93myx.

**Fig. 2. Performance and Confidence Analysis of GPT-4o-Powered Framework.**
a Framework Performance: Confusion matrix comparing actual versus GPT-4o predicted cognitive impairment (CI) stages: CU, MCI, Dementia. Darker colors indicate higher counts. b Performance Analysis Stratified by Physician Confidence scores: Bar plot of weighted Cohen’s kappa scores stratified by physicians’ confidence levels. Higher confidence scores (3 and 4) correspond to greater alignment with ground truth. c Comparison of Physician and GPT-4o confidence scores: Heatmap comparing confidence levels assigned by physicians versus GPT-4o. Darker colors represent higher case counts. Abbreviations: CU Cognitively Unimpaired, MCI Mild Cognitive Impairment.

**Fig. 3. Comparison of Framework Performance.**
**USE Framework**: Keyword-based sentence extraction with Universal Sentence Encoder (USE) embeddings and XGBoost classification. **DementiaBERT Framework**: Keyword-based sentence extraction with DementiaBERT embeddings (fine-tuned on dementia-related clinical language) and XGBoost classification. **Hybrid Framework**: GPT-4o-generated summaries with DementiaBERT embeddings and XGBoost classification. **GPT-4o-Powered Framework**: an End-to-end GPT-4o approach using GPT-4o-generated summaries and GPT-4o classification. a Comparison of Weighted Cohen’s Kappa Scores of the Four Models: Bar plot of weighted Cohen’s kappa scores for four models across 10 cross-validation folds. Each bar represents the kappa score for a specific model on each fold. b Multi-Metric Evaluation of the Four Models Performance: Table summarizing the performance of each model across three evaluation metrics: Cohen’s kappa score, Spearman’s Rank Correlation, and Baccianella’s adapted MSE. Mean and standard deviation values are provided over 10 folds. c Box Plot of the Weighted Cohen’s Kappa Scores of the Four Models Stratified by Sex: Comparison of kappa scores across the four models, stratified by sex (Male and Female), with p-values indicating statistical tests for differences in performance between male and female groups.

**Fig. 4. Performance and Statistical Analysis of GPT-4o-Powered Framework in Assigning Global CDR.**
Normalized confusion matrices for three GPT-4o-based approaches in cognitive impairment staging: a GPT-4o with Structured Guidance, b RAG-Enabled GPT-4o, and c GPT-4o with Confidence Level and Domain Counts; each matrix shows the proportion of actual vs. predicted CDR scores within each row, with darker colors indicating higher proportions. Rows are normalized to sum to 1. d Multi-Metric Evaluation of Model Performance for Assigning Global CDR: Table summarizing model performance across multiple evaluation metrics—Cohen’s kappa score, Spearman’s Rank Correlation, and Baccianella’s adapted MSE. e Association Between CDR Domains and GPT-4o Confidence Levels in Assigning Global CDR: Table showing the statistical association between CDR domains (binary variable indicating documentation of domain in the note) and GPT-4o confidence levels (Medium, High) in assigning global CDR. β coefficients indicate the effect size of the association of each domain with standard errors and p-values.

See this image and copyright information in PMC

References

1. Association, A. S. 2024 Alzheimer’s disease facts and figures. Alzheimer’s Dement.20, 3708–3821, 10.1002/alz.13809 (2024). - PMC - PubMed
1. Robinson, L., Tang, E. & Taylor, J.-P. Dementia: timely diagnosis and early intervention. BMJ350, h3029, 10.1136/bmj.h3029 (2015). - PMC - PubMed
1. Borson, S. et al. Improving dementia care: the role of screening and detection of cognitive impairment. Alzheimer’s Dement.9, 151–159, 10.1016/j.jalz.2012.08.008 (2013). - PMC - PubMed
1. Amjad, H. et al. Underdiagnosis of dementia: an observational study of patterns in diagnosis and awareness in US older adults. J. Gen. Intern. Med.33, 1131–1138, 10.1007/s11606-018-4377-y (2018). - PMC - PubMed
1. Taylor, D. H., Østbye, T., Langa, K. M., Weir, D. & Plassman, B. L. The accuracy of medicare claims as an epidemiological tool: the case of dementia revisited. J. Alzheimer’s Dis.17, 807–815, 10.3233/JAD-2009-1099 (2009). - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A GPT-4o-powered framework for identifying cognitive impairment stages in electronic health records

Affiliations

A GPT-4o-powered framework for identifying cognitive impairment stages in electronic health records

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources