Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun;21(2):633-641.
doi: 10.14245/ns.2448098.049. Epub 2024 Jun 30.

Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard

Affiliations

Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard

Siegmund Philipp Lang et al. Neurospine. 2024 Jun.

Abstract

Objective: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education.

Methods: Our study aims to assess the response quality of Open AI (artificial intelligence)'s ChatGPT 3.5 and Google's Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from 'unsatisfactory' to 'excellent.' The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale.

Results: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard's responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism.

Conclusion: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs' role in medical education and healthcare communication.

Keywords: Artificial intelligence; Bard; ChatGPT; Large language models; Lumbar spine fusion; Patient education.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest

The authors have nothing to disclose.

Figures

Fig. 1.
Fig. 1.
Selection process for identifying top 10 FAQs on lumbar spine fusion surgery. This flowchart illustrates the methodology from initial search to final question curation. ChatGPT, chat generative pre-trained transformer; FAQ, frequently asked questions.
Fig. 2.
Fig. 2.
Pie chart with the distribution of overall ratings, expressed in percentages, for the combined question set across the 2 large language models.
Fig. 3.
Fig. 3.
Distribution of rater evaluations for ChatGPT and Bard across 10 lumbar surgery questions. The bars represent the percentage of raters assigning each category. ChatGPT, chat generative pre-trained transformer.
Fig. 4.
Fig. 4.
Median ratings comparison between ChatGPT 3.5 and Bard. Bars show median scores; error bars show range (minimum–maximum scoring). No significant differences in ratings across all questions. ChatGPT, chat generative pretrained transformer.

References

    1. Mobbs RJ, Phan K, Malham G, et al. Lumbar interbody fusion: techniques, indications and comparison of interbody fusion options including PLIF, TLIF, MI-TLIF, OLIF/ATP, LLIF and ALIF. J Spine Surg. 2015;1:2–18. - PMC - PubMed
    1. Gaudin D, Krafcik BM, Mansour TR, et al. Considerations in spinal fusion surgery for chronic lumbar pain: psychosocial factors, rating scales, and perioperative patient education-a review of the literature. World Neurosurg. 2017;98:21–7. - PubMed
    1. Zhang Z, Yang H, He J, et al. The Impact of treatment-related internet health information seeking on patient compliance. Telemed J E Health. 2021;27:513–24. - PubMed
    1. Cline RJ, Haynes KM. Consumer health information seeking on the Internet: the state of the art. Health Educ Res. 2001;16:671–92. - PubMed
    1. Langford AT, Roberts T, Gupta J, et al. Impact of the internet on patient-physician communication. Eur Urol Focus. 2020;6:440–4. - PubMed

LinkOut - more resources