Large language model responses to patient-oriented neurointerventional queries: A multirater assessment of accuracy, completeness, safety, and actionability
- PMID: 41289178
- DOI: 10.1177/15910199251396358
Large language model responses to patient-oriented neurointerventional queries: A multirater assessment of accuracy, completeness, safety, and actionability
Abstract
BackgroundAs large language models (LLMs) become increasingly accessible to the public, patients are turning to these tools for medical guidance - including in highly specialized fields like interventional neuroradiology. Despite their growing use, the safety, completeness, and reliability of LLM-generated information in subspecialty medicine remain unclear.MethodsFive publicly available LLMs - ChatGPT, Gemini, Claude, Perplexity, and DeepSeek - were prompted with four neurointerventional patient-facing clinical questions spanning ischemic stroke, hemorrhagic stroke, venous disorders, and procedural device use. Each model was queried three times per question to generate unique responses. Eight blinded raters scored each response on accuracy, completeness, safety, and actionability using Likert scales. Plagiarism analyses were also performed.ResultsDeepSeek consistently outperformed other LLMs in accuracy, completeness, and actionability across four prompts, while Gemini frequently ranked worse, including in plagiarism levels. ChatGPT performed well in accuracy. Physicians were more critical than non-physicians across accuracy, completeness, and safety, whereas non-physicians rated actionability significantly lower. Overall, LLMs were rated relatively high (median of >4 on a 5-point scale) in medical safety, suggesting low risk of overtly harmful advice.ConclusionRecent-generation LLMs offer medically safe, though often incomplete or imprecise, information in response to patient-oriented neurointerventional queries. Including non-physician raters revealed valuable differences in perception that are relevant to how patients may interpret LLM outputs. As benchmark frameworks like HealthBench improve LLM evaluation, inclusion of lay perspectives and subspecialty contexts remains essential. Responsible use by clinicians and ongoing patient education will be critical as LLM use in healthcare expands.
Keywords: Aneurysm; device; stroke; technology.
LinkOut - more resources
Full Text Sources
Research Materials
