The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease
- PMID: 38630178
- DOI: 10.1007/s00464-024-10807-w
The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease
Abstract
Background: Large language model (LLM)-linked chatbots may be an efficient source of clinical recommendations for healthcare providers and patients. This study evaluated the performance of LLM-linked chatbots in providing recommendations for the surgical management of gastroesophageal reflux disease (GERD).
Methods: Nine patient cases were created based on key questions addressed by the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) guidelines for the surgical treatment of GERD. ChatGPT-3.5, ChatGPT-4, Copilot, Google Bard, and Perplexity AI were queried on November 16th, 2023, for recommendations regarding the surgical management of GERD. Accurate chatbot performance was defined as the number of responses aligning with SAGES guideline recommendations. Outcomes were reported with counts and percentages.
Results: Surgeons were given accurate recommendations for the surgical management of GERD in an adult patient for 5/7 (71.4%) KQs by ChatGPT-4, 3/7 (42.9%) KQs by Copilot, 6/7 (85.7%) KQs by Google Bard, and 3/7 (42.9%) KQs by Perplexity according to the SAGES guidelines. Patients were given accurate recommendations for 3/5 (60.0%) KQs by ChatGPT-4, 2/5 (40.0%) KQs by Copilot, 4/5 (80.0%) KQs by Google Bard, and 1/5 (20.0%) KQs by Perplexity, respectively. In a pediatric patient, surgeons were given accurate recommendations for 2/3 (66.7%) KQs by ChatGPT-4, 3/3 (100.0%) KQs by Copilot, 3/3 (100.0%) KQs by Google Bard, and 2/3 (66.7%) KQs by Perplexity. Patients were given appropriate guidance for 2/2 (100.0%) KQs by ChatGPT-4, 2/2 (100.0%) KQs by Copilot, 1/2 (50.0%) KQs by Google Bard, and 1/2 (50.0%) KQs by Perplexity.
Conclusions: Gastrointestinal surgeons, gastroenterologists, and patients should recognize both the promise and pitfalls of LLM's when utilized for advice on surgical management of GERD. Additional training of LLM's using evidence-based health information is needed.
Keywords: ChatGPT; GERD; Generative artificial intelligence; Guidelines; Large language models; Natural language processing; Surgery.
© 2024. The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Similar articles
-
Clinical artificial intelligence: teaching a large language model to generate recommendations that align with guidelines for the surgical management of GERD.Surg Endosc. 2024 Oct;38(10):5668-5677. doi: 10.1007/s00464-024-11155-5. Epub 2024 Aug 12. Surg Endosc. 2024. PMID: 39134725
-
Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care.Medicine (Baltimore). 2024 Aug 16;103(33):e39305. doi: 10.1097/MD.0000000000039305. Medicine (Baltimore). 2024. PMID: 39151545 Free PMC article.
-
Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study.JMIR Form Res. 2025 Feb 5;9:e56126. doi: 10.2196/56126. JMIR Form Res. 2025. PMID: 39794312 Free PMC article.
-
ChatGPT and large language model (LLM) chatbots: The current state of acceptability and a proposal for guidelines on utilization in academic medicine.J Pediatr Urol. 2023 Oct;19(5):598-604. doi: 10.1016/j.jpurol.2023.05.018. Epub 2023 Jun 2. J Pediatr Urol. 2023. PMID: 37328321 Review.
-
Exploring the role of artificial intelligence, large language models: Comparing patient-focused information and clinical decision support capabilities to the gynecologic oncology guidelines.Int J Gynaecol Obstet. 2025 Feb;168(2):419-427. doi: 10.1002/ijgo.15869. Epub 2024 Aug 20. Int J Gynaecol Obstet. 2025. PMID: 39161265 Free PMC article. Review.
Cited by
-
A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity.J Clin Med. 2024 Oct 30;13(21):6512. doi: 10.3390/jcm13216512. J Clin Med. 2024. PMID: 39518652 Free PMC article.
-
Large Language Models for Chatbot Health Advice Studies: A Systematic Review.JAMA Netw Open. 2025 Feb 3;8(2):e2457879. doi: 10.1001/jamanetworkopen.2024.57879. JAMA Netw Open. 2025. PMID: 39903463 Free PMC article.
-
Leveraging ChatGPT to strengthen pediatric healthcare systems: a systematic review.Eur J Pediatr. 2025 Jul 12;184(8):478. doi: 10.1007/s00431-025-06320-4. Eur J Pediatr. 2025. PMID: 40650728 Review.
-
Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.Front Digit Health. 2025 Jun 27;7:1574287. doi: 10.3389/fdgth.2025.1574287. eCollection 2025. Front Digit Health. 2025. PMID: 40657647 Free PMC article.
-
Assessing the Accuracy of Artificial Intelligence Models in Scoliosis Classification and Suggested Therapeutic Approaches.J Clin Med. 2024 Jul 9;13(14):4013. doi: 10.3390/jcm13144013. J Clin Med. 2024. PMID: 39064053 Free PMC article.
References
-
- Meyer JG, Urbanowicz RJ, Martin PCN, O’Connor K, Li R, Peng PC, Bright TJ, Tatonetti N, Won KJ, Gonzalez-Hernandez G, Moore JH (2023) ChatGPT and large language models in academia: opportunities and challenges. BioData Min 16:1–11 - DOI
-
- Sakirin T, Ben Said R (2023) User preferences for ChatGPT-powered conversational interfaces versus traditional methods. MJCSC. https://doi.org/10.58496/MJCSC/2023/004 - DOI
-
- Dave T, Athaluri SA, Singh S (2023) ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell 6:1–5 - DOI
-
- Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM (2023) Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 183:589–596. https://doi.org/10.1001/jamainternmed.2023.1838 - DOI - PubMed - PMC
MeSH terms
LinkOut - more resources
Full Text Sources
Medical