Large language model responses to patient-oriented neurointerventional queries: A multirater assessment of accuracy, completeness, safety, and actionability

Albert Hw Jiang¹, Tyler R Ray^{2

3}, Alisha E Suri¹, Andrew R Menard⁴, Ryan T Kellogg⁵, Arindam R Chatterjee^{6

7}, Matthew A Koenig^{1

8}, Roy K Esaki⁹, Ferdinand K Hui¹⁰, Jan Vargas¹¹

Affiliations

¹ John A. Burns School of Medicine, University of Hawai'i at Mānoa, Honolulu, HI, USA.
² Mechanical Engineering, University of Hawai'i at Mānoa, Honolulu, HI, USA.
³ Cell and Molecular Biology, John A. Burns School of Medicine, University of Hawai'i at Mānoa, Honolulu, HI, USA.
⁴ Johns Hopkins HealthCare LLC, Baltimore, MD, USA.
⁵ Neurosurgery, University of Virginia, Charlottesville, VA, USA.
⁶ Mallinckrodt Institute of Radiology, Washington University School of Medicine, St Louis, MO, USA.
⁷ Washington University School of Medicine, St Louis, MO, USA.
⁸ Neurology, The Queen's Health Systems, Honolulu, HI, USA.
⁹ Anesthesiology, The Queen's Health Systems, Honolulu, HI, USA.
¹⁰ Neurointerventional Surgery, The Queen's Health Systems, Honolulu, HI, USA.
¹¹ Neurosurgery, Prima Health Upstate, Greenville, SC, USA.

PMID: 41289178
DOI: 10.1177/15910199251396358

Large language model responses to patient-oriented neurointerventional queries: A multirater assessment of accuracy, completeness, safety, and actionability

Albert Hw Jiang et al. Interv Neuroradiol. 2025.

. 2025 Nov 25:15910199251396358.

doi: 10.1177/15910199251396358. Online ahead of print.

Authors

Affiliations

¹ John A. Burns School of Medicine, University of Hawai'i at Mānoa, Honolulu, HI, USA.
² Mechanical Engineering, University of Hawai'i at Mānoa, Honolulu, HI, USA.
³ Cell and Molecular Biology, John A. Burns School of Medicine, University of Hawai'i at Mānoa, Honolulu, HI, USA.
⁴ Johns Hopkins HealthCare LLC, Baltimore, MD, USA.
⁵ Neurosurgery, University of Virginia, Charlottesville, VA, USA.
⁶ Mallinckrodt Institute of Radiology, Washington University School of Medicine, St Louis, MO, USA.
⁷ Washington University School of Medicine, St Louis, MO, USA.
⁸ Neurology, The Queen's Health Systems, Honolulu, HI, USA.
⁹ Anesthesiology, The Queen's Health Systems, Honolulu, HI, USA.
¹⁰ Neurointerventional Surgery, The Queen's Health Systems, Honolulu, HI, USA.
¹¹ Neurosurgery, Prima Health Upstate, Greenville, SC, USA.

PMID: 41289178
DOI: 10.1177/15910199251396358

Abstract

BackgroundAs large language models (LLMs) become increasingly accessible to the public, patients are turning to these tools for medical guidance - including in highly specialized fields like interventional neuroradiology. Despite their growing use, the safety, completeness, and reliability of LLM-generated information in subspecialty medicine remain unclear.MethodsFive publicly available LLMs - ChatGPT, Gemini, Claude, Perplexity, and DeepSeek - were prompted with four neurointerventional patient-facing clinical questions spanning ischemic stroke, hemorrhagic stroke, venous disorders, and procedural device use. Each model was queried three times per question to generate unique responses. Eight blinded raters scored each response on accuracy, completeness, safety, and actionability using Likert scales. Plagiarism analyses were also performed.ResultsDeepSeek consistently outperformed other LLMs in accuracy, completeness, and actionability across four prompts, while Gemini frequently ranked worse, including in plagiarism levels. ChatGPT performed well in accuracy. Physicians were more critical than non-physicians across accuracy, completeness, and safety, whereas non-physicians rated actionability significantly lower. Overall, LLMs were rated relatively high (median of >4 on a 5-point scale) in medical safety, suggesting low risk of overtly harmful advice.ConclusionRecent-generation LLMs offer medically safe, though often incomplete or imprecise, information in response to patient-oriented neurointerventional queries. Including non-physician raters revealed valuable differences in perception that are relevant to how patients may interpret LLM outputs. As benchmark frameworks like HealthBench improve LLM evaluation, inclusion of lay perspectives and subspecialty contexts remains essential. Responsible use by clinicians and ongoing patient education will be critical as LLM use in healthcare expands.

Keywords: Aneurysm; device; stroke; technology.

PubMed Disclaimer

LinkOut - more resources

Full Text Sources
- Atypon
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large language model responses to patient-oriented neurointerventional queries: A multirater assessment of accuracy, completeness, safety, and actionability

Affiliations

Large language model responses to patient-oriented neurointerventional queries: A multirater assessment of accuracy, completeness, safety, and actionability

Authors

Affiliations

Abstract

LinkOut - more resources

Full Text Sources

Research Materials