The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard

Suzen Agharia¹, Jan Szatkowski², Andrew Fraval¹, Jarrad Stevens¹, Yushy Zhou¹

Affiliations

¹ Department of Orthopaedic Surgery, St. Vincent's Hospital, Melbourne, Victoria, Australia.
² Department of Orthopaedic Surgery, Indiana University Health Methodist Hospital, Indianapolis, IN, USA.

PMID: 38148925
PMCID: PMC10749221
DOI: 10.1016/j.jor.2023.11.063

The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard

Suzen Agharia et al. J Orthop. 2023.

. 2023 Dec 1:50:1-7.

doi: 10.1016/j.jor.2023.11.063. eCollection 2024 Apr.

Authors

Suzen Agharia¹, Jan Szatkowski², Andrew Fraval¹, Jarrad Stevens¹, Yushy Zhou¹

Affiliations

¹ Department of Orthopaedic Surgery, St. Vincent's Hospital, Melbourne, Victoria, Australia.
² Department of Orthopaedic Surgery, Indiana University Health Methodist Hospital, Indianapolis, IN, USA.

PMID: 38148925
PMCID: PMC10749221
DOI: 10.1016/j.jor.2023.11.063

Abstract

Background: Recent advancements in artificial intelligence (AI) have sparked interest in its integration into clinical medicine and education. This study evaluates the performance of three AI tools compared to human clinicians in addressing complex orthopaedic decisions in real-world clinical cases.

Questions/purposes: To evaluate the ability of commonly used AI tools to formulate orthopaedic clinical decisions in comparison to human clinicians.

Patients and methods: The study used OrthoBullets Cases, a publicly available clinical cases collaboration platform where surgeons from around the world choose treatment options based on peer-reviewed standardised treatment polls. The clinical cases cover various orthopaedic categories. Three AI tools, (ChatGPT 3.5, ChatGPT 4, and Bard), were evaluated. Uniform prompts were used to input case information including questions relating to the case, and the AI tools' responses were analysed for alignment with the most popular response, within 10%, and within 20% of the most popular human responses.

Results: In total, 8 clinical categories comprising of 97 questions were analysed. ChatGPT 4 demonstrated the highest proportion of most popular responses (proportion of most popular response: ChatGPT 4 68.0%, ChatGPT 3.5 40.2%, Bard 45.4%, P value < 0.001), outperforming other AI tools. AI tools performed poorer in questions that were considered controversial (where disagreement occurred in human responses). Inter-tool agreement, as evaluated using Cohen's kappa coefficient, ranged from 0.201 (ChatGPT 4 vs. Bard) to 0.634 (ChatGPT 3.5 vs. Bard). However, AI tool responses varied widely, reflecting a need for consistency in real-world clinical applications.

Conclusions: While AI tools demonstrated potential use in educational contexts, their integration into clinical decision-making requires caution due to inconsistent responses and deviations from peer consensus. Future research should focus on specialised clinical AI tool development to maximise utility in clinical decision-making.

Level of evidence: IV.

PubMed Disclaimer

Conflict of interest statement

All authors declare they have no conflicts of interest relating to this study.

Figures

**Fig. 1**
Example screenshot of the questions and response options for clinical cases published on OrthoBullets.

**Fig. 2**
Distribution of the proportion of OrthoBullets members who voted for the same response as the AI tool response – A) ChatGPT 3.5, B) ChatGPT 4, and C) Bard.

**Fig. 3**
Clustered bar plot depicting proportions of AI tool responses that aligned with A) the most popular responses, B) within 10% of the most popular responses, and C) within 20% of the most popular responses.

See this image and copyright information in PMC

References

1. Banerjee M., Chiew D., Patel K.T., et al. The impact of artificial intelligence on clinical education: perceptions of postgraduate trainee doctors in London (UK) and recommendations for trainers. BMC Med Educ. 2021;21 doi: 10.1186/s12909-021-02870-x. - DOI - PMC - PubMed
1. Garvey K.V., Thomas Craig K.J., Russell R., et al. Considering clinician competencies for the implementation of artificial intelligence–based tools in health care: findings from a scoping review. JMIR Med Inform. 2022;10 doi: 10.2196/37478. - DOI - PMC - PubMed
1. Shung D.L., Sung J.J.Y. Challenges of developing artificial intelligence-assisted tools for clinical medicine. J Gastroenterol Hepatol. 2021;36:295–298. doi: 10.1111/jgh.15378. - DOI - PMC - PubMed
1. Park S.H., Han K. Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology. 2018;286:800–809. doi: 10.1148/radiol.2017171920. - DOI - PubMed
1. Zhou Y., Dowsey M., Spelman T., et al. SMART choice (knee) tool: a patient-focused predictive model to predict improvement in health-related quality of life after total knee arthroplasty. ANZ J Surg. 2023 doi: 10.1111/ans.18250. - DOI - PubMed

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard

Affiliations

The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Research Materials