The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard
- PMID: 38148925
- PMCID: PMC10749221
- DOI: 10.1016/j.jor.2023.11.063
The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard
Abstract
Background: Recent advancements in artificial intelligence (AI) have sparked interest in its integration into clinical medicine and education. This study evaluates the performance of three AI tools compared to human clinicians in addressing complex orthopaedic decisions in real-world clinical cases.
Questions/purposes: To evaluate the ability of commonly used AI tools to formulate orthopaedic clinical decisions in comparison to human clinicians.
Patients and methods: The study used OrthoBullets Cases, a publicly available clinical cases collaboration platform where surgeons from around the world choose treatment options based on peer-reviewed standardised treatment polls. The clinical cases cover various orthopaedic categories. Three AI tools, (ChatGPT 3.5, ChatGPT 4, and Bard), were evaluated. Uniform prompts were used to input case information including questions relating to the case, and the AI tools' responses were analysed for alignment with the most popular response, within 10%, and within 20% of the most popular human responses.
Results: In total, 8 clinical categories comprising of 97 questions were analysed. ChatGPT 4 demonstrated the highest proportion of most popular responses (proportion of most popular response: ChatGPT 4 68.0%, ChatGPT 3.5 40.2%, Bard 45.4%, P value < 0.001), outperforming other AI tools. AI tools performed poorer in questions that were considered controversial (where disagreement occurred in human responses). Inter-tool agreement, as evaluated using Cohen's kappa coefficient, ranged from 0.201 (ChatGPT 4 vs. Bard) to 0.634 (ChatGPT 3.5 vs. Bard). However, AI tool responses varied widely, reflecting a need for consistency in real-world clinical applications.
Conclusions: While AI tools demonstrated potential use in educational contexts, their integration into clinical decision-making requires caution due to inconsistent responses and deviations from peer consensus. Future research should focus on specialised clinical AI tool development to maximise utility in clinical decision-making.
Level of evidence: IV.
© 2023 The Authors.
Conflict of interest statement
All authors declare they have no conflicts of interest relating to this study.
Figures



Similar articles
-
Evaluation of the Current Status of Artificial Intelligence for Endourology Patient Education: A Blind Comparison of ChatGPT and Google Bard Against Traditional Information Resources.J Endourol. 2024 Aug;38(8):843-851. doi: 10.1089/end.2023.0696. Epub 2024 May 17. J Endourol. 2024. PMID: 38441078
-
Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis.J Arthroplasty. 2024 May;39(5):1184-1190. doi: 10.1016/j.arth.2024.01.029. Epub 2024 Jan 17. J Arthroplasty. 2024. PMID: 38237878
-
Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions.Adv Med Educ Pract. 2024 Sep 20;15:857-871. doi: 10.2147/AMEP.S479801. eCollection 2024. Adv Med Educ Pract. 2024. PMID: 39319062 Free PMC article.
-
Can Artificial Intelligence Fool Residency Selection Committees? Analysis of Personal Statements by Real Applicants and Generative AI, a Randomized, Single-Blind Multicenter Study.JB JS Open Access. 2024 Oct 24;9(4):e24.00028. doi: 10.2106/JBJS.OA.24.00028. eCollection 2024 Oct-Dec. JB JS Open Access. 2024. PMID: 39450246 Free PMC article. Review.
-
ChatGPT Performs at the Level of a Third-Year Orthopaedic Surgery Resident on the Orthopaedic In-Training Examination.JB JS Open Access. 2023 Dec 11;8(4):e23.00103. doi: 10.2106/JBJS.OA.23.00103. eCollection 2023 Oct-Dec. JB JS Open Access. 2023. PMID: 38638869 Free PMC article. Review.
Cited by
-
Assessing the Current Limitations of Large Language Models in Advancing Health Care Education.JMIR Form Res. 2025 Jan 16;9:e51319. doi: 10.2196/51319. JMIR Form Res. 2025. PMID: 39819585 Free PMC article.
-
Evaluating DeepResearch and DeepThink in anterior cruciate ligament surgery patient education: ChatGPT-4o excels in comprehensiveness, DeepSeek R1 leads in clarity and readability of orthopaedic information.Knee Surg Sports Traumatol Arthrosc. 2025 Aug;33(8):3025-3031. doi: 10.1002/ksa.12711. Epub 2025 Jun 1. Knee Surg Sports Traumatol Arthrosc. 2025. PMID: 40450565 Free PMC article.
-
Evaluating the quality and readability of ChatGPT-generated patient-facing medical information in rhinology.Eur Arch Otorhinolaryngol. 2025 Apr;282(4):1911-1920. doi: 10.1007/s00405-024-09180-0. Epub 2024 Dec 26. Eur Arch Otorhinolaryngol. 2025. PMID: 39724239
-
Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems.Diagnostics (Basel). 2024 Jul 11;14(14):1491. doi: 10.3390/diagnostics14141491. Diagnostics (Basel). 2024. PMID: 39061628 Free PMC article.
-
Examining the Role of Large Language Models in Orthopedics: Systematic Review.J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607. J Med Internet Res. 2024. PMID: 39546795 Free PMC article.
References
LinkOut - more resources
Full Text Sources
Research Materials