Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 3;8(1):401.
doi: 10.1038/s41746-025-01834-5.

A GPT-4o-powered framework for identifying cognitive impairment stages in electronic health records

Affiliations

A GPT-4o-powered framework for identifying cognitive impairment stages in electronic health records

Yu Leng et al. NPJ Digit Med. .

Abstract

Alzheimer's Disease and Related Dementias (ADRD) pose a major public health challenge, with a critical need for accurate and scalable tools for detecting cognitive impairment (CI). Readily available electronic health records (EHRs) contain valuable cognitive health data, but much of it is embedded in unstructured clinical notes. To address this problem, we developed a GPT-4o-powered framework for CI stage classification, leveraging longitudinal patient history summarization, multi-step reasoning, and confidence-aware decision-making. Evaluated on 165,926 notes from 1002 Medicare patients from Mass General Brigham (MGB), our GPT-4o framework achieved high accuracy in CI stage classification (weighted Cohen's kappa = 0.95, Spearman correlation = 0.93), and outperformed two other language models (weighted Cohen's kappa 0.82-0.85). Our framework also achieved high performance on Clinical Dementia Rating (CDR) scoring on an independent dataset of 769 memory clinic patients (weighted Cohen's kappa = 0.83). Finally, to ensure reliability and safety, we designed an interactive AI agent integrating our GPT-4o-powered framework and clinician oversight. This collaborative approach has the potential to facilitate CI diagnoses in real-world clinical settings.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Dr. Westover is a co-founder, scientific advisor, consultant to, and has personal equity interest in Beacon Biosignals.

Figures

Fig. 1
Fig. 1. Overview of the Workflow for Cognitive Impairment (CI) Staging Across Four Frameworks.
a End-to-End GPT-4o-Powered Framework: our framework, an End-to-end GPT-4o approach using multi-note summaries, chunked and summarized further into a “summary of summaries,” outputting CI stage, summary and confidence level. b Other Three Frameworks for Comparison. USE Framework: Keyword-based sentence extraction with Universal Sentence Encoder (USE) embeddings, Recursive Feature Elimination (RFE), and XGBoost for classification. DementiaBERT Framework: Keyword-based sentence extraction with DementiaBERT embeddings (fine-tuned on dementia-related clinical language) and XGBoost classification. Hybrid Framework: GPT-4o-generated summaries of clinical notes, chunked and embedded using DementiaBERT, with XGBoost for classification. The figure was partially created in BioRender. He, Y. (2025) https://BioRender.com/4i93myx.
Fig. 2
Fig. 2. Performance and Confidence Analysis of GPT-4o-Powered Framework.
a Framework Performance: Confusion matrix comparing actual versus GPT-4o predicted cognitive impairment (CI) stages: CU, MCI, Dementia. Darker colors indicate higher counts. b Performance Analysis Stratified by Physician Confidence scores: Bar plot of weighted Cohen’s kappa scores stratified by physicians’ confidence levels. Higher confidence scores (3 and 4) correspond to greater alignment with ground truth. c Comparison of Physician and GPT-4o confidence scores: Heatmap comparing confidence levels assigned by physicians versus GPT-4o. Darker colors represent higher case counts. Abbreviations: CU Cognitively Unimpaired, MCI Mild Cognitive Impairment.
Fig. 3
Fig. 3. Comparison of Framework Performance.
USE Framework: Keyword-based sentence extraction with Universal Sentence Encoder (USE) embeddings and XGBoost classification. DementiaBERT Framework: Keyword-based sentence extraction with DementiaBERT embeddings (fine-tuned on dementia-related clinical language) and XGBoost classification. Hybrid Framework: GPT-4o-generated summaries with DementiaBERT embeddings and XGBoost classification. GPT-4o-Powered Framework: an End-to-end GPT-4o approach using GPT-4o-generated summaries and GPT-4o classification. a Comparison of Weighted Cohen’s Kappa Scores of the Four Models: Bar plot of weighted Cohen’s kappa scores for four models across 10 cross-validation folds. Each bar represents the kappa score for a specific model on each fold. b Multi-Metric Evaluation of the Four Models Performance: Table summarizing the performance of each model across three evaluation metrics: Cohen’s kappa score, Spearman’s Rank Correlation, and Baccianella’s adapted MSE. Mean and standard deviation values are provided over 10 folds. c Box Plot of the Weighted Cohen’s Kappa Scores of the Four Models Stratified by Sex: Comparison of kappa scores across the four models, stratified by sex (Male and Female), with p-values indicating statistical tests for differences in performance between male and female groups.
Fig. 4
Fig. 4. Performance and Statistical Analysis of GPT-4o-Powered Framework in Assigning Global CDR.
Normalized confusion matrices for three GPT-4o-based approaches in cognitive impairment staging: a GPT-4o with Structured Guidance, b RAG-Enabled GPT-4o, and c GPT-4o with Confidence Level and Domain Counts; each matrix shows the proportion of actual vs. predicted CDR scores within each row, with darker colors indicating higher proportions. Rows are normalized to sum to 1. d Multi-Metric Evaluation of Model Performance for Assigning Global CDR: Table summarizing model performance across multiple evaluation metrics—Cohen’s kappa score, Spearman’s Rank Correlation, and Baccianella’s adapted MSE. e Association Between CDR Domains and GPT-4o Confidence Levels in Assigning Global CDR: Table showing the statistical association between CDR domains (binary variable indicating documentation of domain in the note) and GPT-4o confidence levels (Medium, High) in assigning global CDR. β coefficients indicate the effect size of the association of each domain with standard errors and p-values.

References

    1. Association, A. S. 2024 Alzheimer’s disease facts and figures. Alzheimer’s Dement.20, 3708–3821, 10.1002/alz.13809 (2024). - PMC - PubMed
    1. Robinson, L., Tang, E. & Taylor, J.-P. Dementia: timely diagnosis and early intervention. BMJ350, h3029, 10.1136/bmj.h3029 (2015). - PMC - PubMed
    1. Borson, S. et al. Improving dementia care: the role of screening and detection of cognitive impairment. Alzheimer’s Dement.9, 151–159, 10.1016/j.jalz.2012.08.008 (2013). - PMC - PubMed
    1. Amjad, H. et al. Underdiagnosis of dementia: an observational study of patterns in diagnosis and awareness in US older adults. J. Gen. Intern. Med.33, 1131–1138, 10.1007/s11606-018-4377-y (2018). - PMC - PubMed
    1. Taylor, D. H., Østbye, T., Langa, K. M., Weir, D. & Plassman, B. L. The accuracy of medicare claims as an epidemiological tool: the case of dementia revisited. J. Alzheimer’s Dis.17, 807–815, 10.3233/JAD-2009-1099 (2009). - PMC - PubMed

LinkOut - more resources