Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios
- PMID: 38574939
- DOI: 10.1016/j.athoracsur.2024.03.023
Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios
Abstract
Background: Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios.
Methods: Topic domains were identified and clinical scenarios were developed within each domain. Each scenario included 3 stems using Key Feature methods related to diagnosis, evaluation, and treatment. Twelve scenarios were presented to ChatGPT-4 (OpenAI), Bard (recently renamed Gemini; Google), Perplexity (Perplexity AI), and Claude 2 (Anthropic) in 3 separate runs. Up to 1 point was awarded for each stem, yielding a potential of 3 points per scenario. Critical failures were identified before scoring; if they occurred, the stem and overall scenario scores were adjusted to 0. We arbitrarily established a threshold of ≥2 points mean adjusted score per scenario as a passing grade and established a critical fail rate of ≥30% as failure to pass.
Results: The bot performances varied considerably within each run, and their overall performance was a fail on all runs (critical mean scenario fails of 83%, 71%, and 71%). The bots trended toward "learning" from the first to the second run, but without improvement in overall raw (1.24 ± 0.47 vs 1.63 ± 0.76 vs 1.51 ± 0.60; P = .29) and adjusted (0.44 ± 0.54 vs 0.80 ± 0.94 vs 0.76 ± 0.81; P = .48) scenario scores after all runs.
Conclusions: Chatbot performance in managing clinical scenarios was insufficient to provide reliable assistance. This is a cautionary note against reliance on the current accuracy of chatbots in complex thoracic surgery medical decision making.
Copyright © 2024 The Society of Thoracic Surgeons. Published by Elsevier Inc. All rights reserved.
Comment in
-
"Pseudo" Intelligence or Misguided or Mis-sourced Intelligence?Ann Thorac Surg. 2024 Jul;118(1):281-282. doi: 10.1016/j.athoracsur.2024.04.007. Epub 2024 Apr 24. Ann Thorac Surg. 2024. PMID: 38663658 No abstract available.
-
Enhancing Artificial Intelligence Chatbot Reliability in Medical Scenarios.Ann Thorac Surg. 2024 Dec;118(6):1341. doi: 10.1016/j.athoracsur.2024.04.036. Epub 2024 May 24. Ann Thorac Surg. 2024. PMID: 38797226 No abstract available.
References
MeSH terms
LinkOut - more resources
Full Text Sources
