Supervised machine learning compared to large language models for identifying functional seizures from medical records

Wesley T Kerr^{1

2

3

4}, Katherine N McFarlane¹, Gabriela Figueiredo Pucci¹, Danielle R Carns¹, Alex Israel¹, Lianne Vighetti⁵, Page B Pennell¹, John M Stern², Zongqi Xia¹, Yanshan Wang^{4

6

7}

Affiliations

¹ Department of Neurology, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
² Department of Neurology, University of California, Los Angeles, Los Angeles, California, USA.
³ Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, Los Angeles, California, USA.
⁴ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
⁵ Department of Social Work, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
⁶ Intelligent Systems Program, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
⁷ Department of Health Information Management, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.

PMID: 39960122
PMCID: PMC11997926
DOI: 10.1111/epi.18272

Comparative Study

Supervised machine learning compared to large language models for identifying functional seizures from medical records

Wesley T Kerr et al. Epilepsia. 2025 Apr.

. 2025 Apr;66(4):1155-1164.

doi: 10.1111/epi.18272. Epub 2025 Feb 17.

Authors

Affiliations

¹ Department of Neurology, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
² Department of Neurology, University of California, Los Angeles, Los Angeles, California, USA.
³ Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, Los Angeles, California, USA.
⁴ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
⁵ Department of Social Work, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
⁶ Intelligent Systems Program, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
⁷ Department of Health Information Management, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.

PMID: 39960122
PMCID: PMC11997926
DOI: 10.1111/epi.18272

Abstract

Objective: The Functional Seizures Likelihood Score (FSLS) is a supervised machine learning-based diagnostic score that was developed to differentiate functional seizures (FS) from epileptic seizures (ES). In contrast to this targeted approach, large language models (LLMs) can identify patterns in data for which they were not specifically trained. To evaluate the relative benefits of each approach, we compared the diagnostic performance of the FSLS to two LLMs: ChatGPT and GPT-4.

Methods: In total, 114 anonymized cases were constructed based on patients with documented FS, ES, mixed ES and FS, or physiologic seizure-like events (PSLEs). Text-based data were presented in three sequential prompts to the LLMs, showing the history of present illness (HPI), electroencephalography (EEG) results, and neuroimaging results. We compared the accuracy (number of correct predictions/number of cases) and area under the receiver-operating characteristic (ROC) curves (AUCs) of the LLMs to the FSLS using mixed-effects logistic regression.

Results: The accuracy of FSLS was 74% (95% confidence interval [CI] 65%-82%) and the AUC was 85% (95% CI 77%-92%). GPT-4 was superior to both the FSLS and ChatGPT (p <.001), with an accuracy of 85% (95% CI 77%-91%) and AUC of 87% (95% CI 79%-95%). Cohen's kappa between the FSLS and GPT-4 was 40% (fair). The LLMs provided different predictions on different days when the same note was provided for 33% of patients, and the LLM's self-rated certainty was moderately correlated with this observed variability (Spearman's rho²: 30% [fair, ChatGPT] and 63% [substantial, GPT-4]).

Significance: Both GPT-4 and the FSLS identified a substantial subset of patients with FS based on clinical history. The fair agreement in predictions highlights that the LLMs identified patients differently from the structured score. The inconsistency of the LLMs' predictions across days and incomplete insight into their own consistency was concerning. This comparison highlights both benefits and cautions about how machine learning and artificial intelligence could identify patients with FS in clinical practice.

Keywords: electronic health record; informatics; physiologic seizure‐like events; psychogenic nonepileptic seizures (PNES); sensitivity.

PubMed Disclaimer

Figures

**FIGURE 1**
Predictions of the FSLS, ChatGPT, and GPT‐4 in patients with each type of video‐EEG‐based diagnosis. The numbers within the bars reflect the portion of that patient group with each predicted diagnosis. The FSLS never predicted mixed ES + FS (blue) or PSLEs (green), whereas ChatGPT and GPT‐4 commonly predicted mixed ES + FS (blue). (See Tables S2 and S3 for detailed performance statistics.) EEG, electroencephalography; ES, epileptic seizures; FS, functional seizures; FSLS, Functional Seizures Likelihood Score; GPT, Generative Pre‐trained Transformer; PSLE, physiologic seizure‐like event.

**FIGURE 2**
The predictions of the FSLS were correlated with but had only fair to moderate agreement with the predictions of ChatGPT (A, Cohen's kappa 26%) and GPT‐4 (B, Cohen's kappa 42%). Each dot reflects a patient, and colors reflect the ictal video‐EEG monitoring–based gold standard diagnosis. Correct predictions of functional seizures (FS) would be in the top right quadrant. Correct predictions of epileptic seizures (ES) would be in the bottom left. Disagreements between methods are in the top left and bottom right. ChatGPT and GPT‐4 often predicted epilepsy only, so all patients stacked on the left axis were predicted to have epilepsy. EEG, electroencephalography; ES, epileptic seizures; FS, functional seizures; FSLS, Functional Seizures Likelihood Score; PSLE, physiologic seizure‐like events.

**FIGURE 3**
The LLMs provided different answers to the same patient on different days, and they had only poor insight into this uncertainty (Spearman's rho²: 30% [ChatGPT, A] and 63% [GPT‐4, B]). Similar to Figure 2, each dot reflects a patient, and colors reflect the ictal video‐EEG monitoring–based gold standard diagnosis. Perfect insight into this uncertainty would be along the diagonal line, whereas the distance from the diagonal line reflects differences between self‐reported certainty and observed certainty. Due to the high number of patients with high predicted probability of epilepsy, dots off the axis reflect observed predicted probability of epilepsy of 0%. EEG, electroencephalography; LLM, large language model.

**FIGURE 4**
ROC curves for the FSLS, ChatGPT, and GPT‐4. The non‐overlapping nature of these curves suggests that the algorithms made these predictions differently. They were not just a difference in sensitivity threshold. ES, epileptic seizures; FS, functional seizures; FSLS, Functional Seizures Likelihood Score; ROC, receiver‐operating characteristic.

See this image and copyright information in PMC

References

1. Seneviratne U, Low ZM, Low ZX, Hehir A, Paramaswaran S, Foong M, et al. Medical health care utilization cost of patients presenting with psychogenic nonepileptic seizures. Epilepsia. 2019;60(2):349–357. - PubMed
1. Tan M, Pearce N, Tobias A, Cook MJ, D'Souza WJ. Influence of comorbidity on mortality in patients with epilepsy and psychogenic nonepileptic seizures. Epilepsia. 2023;64(4):1035–1045. - PubMed
1. Zhang L, Beghi E, Tomson T, Beghi M, Erba G, Chang Z. Mortality in patients with psychogenic non‐epileptic seizures a population‐based cohort study. J Neurol Neurosurg Psychiatry. 2022;93(4):379–385. - PubMed
1. Nightscales R, McCartney L, Auves C, Tao G, Barnard S, Malpas CB, et al. Mortality in patients with psychogenic nonepileptic seizures. Neurology. 2020;95(6):e643–e652. - PubMed
1. Kerr WT, Sreenivasan SS, Allas CH, Janio EA, Karimi AH, Dubey I, et al. Title: functional seizures across the adult lifespan: female sex, delay to diagnosis and disability. Seizure. 2021;91:476–483. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Supervised machine learning compared to large language models for identifying functional seizures from medical records

Affiliations

Supervised machine learning compared to large language models for identifying functional seizures from medical records

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical