Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov 25:15910199251396358.
doi: 10.1177/15910199251396358. Online ahead of print.

Large language model responses to patient-oriented neurointerventional queries: A multirater assessment of accuracy, completeness, safety, and actionability

Affiliations

Large language model responses to patient-oriented neurointerventional queries: A multirater assessment of accuracy, completeness, safety, and actionability

Albert Hw Jiang et al. Interv Neuroradiol. .

Abstract

BackgroundAs large language models (LLMs) become increasingly accessible to the public, patients are turning to these tools for medical guidance - including in highly specialized fields like interventional neuroradiology. Despite their growing use, the safety, completeness, and reliability of LLM-generated information in subspecialty medicine remain unclear.MethodsFive publicly available LLMs - ChatGPT, Gemini, Claude, Perplexity, and DeepSeek - were prompted with four neurointerventional patient-facing clinical questions spanning ischemic stroke, hemorrhagic stroke, venous disorders, and procedural device use. Each model was queried three times per question to generate unique responses. Eight blinded raters scored each response on accuracy, completeness, safety, and actionability using Likert scales. Plagiarism analyses were also performed.ResultsDeepSeek consistently outperformed other LLMs in accuracy, completeness, and actionability across four prompts, while Gemini frequently ranked worse, including in plagiarism levels. ChatGPT performed well in accuracy. Physicians were more critical than non-physicians across accuracy, completeness, and safety, whereas non-physicians rated actionability significantly lower. Overall, LLMs were rated relatively high (median of >4 on a 5-point scale) in medical safety, suggesting low risk of overtly harmful advice.ConclusionRecent-generation LLMs offer medically safe, though often incomplete or imprecise, information in response to patient-oriented neurointerventional queries. Including non-physician raters revealed valuable differences in perception that are relevant to how patients may interpret LLM outputs. As benchmark frameworks like HealthBench improve LLM evaluation, inclusion of lay perspectives and subspecialty contexts remains essential. Responsible use by clinicians and ongoing patient education will be critical as LLM use in healthcare expands.

Keywords: Aneurysm; device; stroke; technology.

PubMed Disclaimer

LinkOut - more resources