ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source

Diane Ghanem¹, Alexander R Zhu², Whitney Kagabo¹, Greg Osgood¹, Babar Shafiq¹

Affiliations

¹ Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland.
² School of Medicine, The Johns Hopkins University, Baltimore, Maryland.

PMID: 39238880
PMCID: PMC11368215
DOI: 10.2106/JBJS.OA.24.00099

Review

ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source

Diane Ghanem et al. JB JS Open Access. 2024.

. 2024 Sep 5;9(3):e24.00099.

doi: 10.2106/JBJS.OA.24.00099. eCollection 2024 Jul-Sep.

Authors

Diane Ghanem¹, Alexander R Zhu², Whitney Kagabo¹, Greg Osgood¹, Babar Shafiq¹

Affiliations

¹ Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland.
² School of Medicine, The Johns Hopkins University, Baltimore, Maryland.

PMID: 39238880
PMCID: PMC11368215
DOI: 10.2106/JBJS.OA.24.00099

Abstract

Introduction: The artificial intelligence language model Chat Generative Pretrained Transformer (ChatGPT) has shown potential as a reliable and accessible educational resource in orthopaedic surgery. Yet, the accuracy of the references behind the provided information remains elusive, which poses a concern for maintaining the integrity of medical content. This study aims to examine the accuracy of the references provided by ChatGPT-4 concerning the Airway, Breathing, Circulation, Disability, Exposure (ABCDE) approach in trauma surgery.

Methods: Two independent reviewers critically assessed 30 ChatGPT-4-generated references supporting the well-established ABCDE approach to trauma protocol, grading them as 0 (nonexistent), 1 (inaccurate), or 2 (accurate). All discrepancies between the ChatGPT-4 and PubMed references were carefully reviewed and bolded. Cohen's Kappa coefficient was used to examine the agreement of the accuracy scores of the ChatGPT-4-generated references between reviewers. Descriptive statistics were used to summarize the mean reference accuracy scores. To compare the variance of the means across the 5 categories, one-way analysis of variance was used.

Results: ChatGPT-4 had an average reference accuracy score of 66.7%. Of the 30 references, only 43.3% were accurate and deemed "true" while 56.7% were categorized as "false" (43.3% inaccurate and 13.3% nonexistent). The accuracy was consistent across the 5 trauma protocol categories, with no significant statistical difference (p = 0.437).

Discussion: With 57% of references being inaccurate or nonexistent, ChatGPT-4 has fallen short in providing reliable and reproducible references-a concerning finding for the safety of using ChatGPT-4 for professional medical decision making without thorough verification. Only if used cautiously, with cross-referencing, can this language model act as an adjunct learning tool that can enhance comprehensiveness as well as knowledge rehearsal and manipulation.

PubMed Disclaimer

Conflict of interest statement

Disclosure: The Disclosure of Potential Conflicts of Interest forms are provided with the online version of the article (http://links.lww.com/JBJSOA/A667).

Figures

**Fig. 1**
ChatGPT-generated answer (version 4.0) regarding the ABCDE approach to trauma protocol. ABCDE = Airway, Breathing, Circulation, Disability, Exposure, and ChatGPT = Chat Generative Pretrained Transformer.

**Fig. 2**
ChatGPT-generated scientific references to support each of the 5 steps of the ABCDE approach to the trauma protocol. ABCDE = Airway, Breathing, Circulation, Disability, Exposure, and ChatGPT = Chat Generative Pretrained Transformer.

**Fig. 3**
Pie chart showing the accuracy of the ChatGPT-generated references categorized by “true” or “false.” ChatGPT = Chat Generative Pretrained Transformer.

See this image and copyright information in PMC

References

1. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New Engl J Med. 2023;388(13):1233-9. - PubMed
1. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. - PMC - PubMed
1. Kung JE, Marshall C, Gauthier C, Gonzalez TA, Jackson JB. Evaluating ChatGPT performance on the orthopaedic in-training examination. JBJS Open Access. 2023;8(3):e23.00056. - PMC - PubMed
1. Ghanem D, Covarrubias O, Raad M, LaPorte D, Shafiq B. ChatGPT performs at the level of a third-year orthopaedic surgery resident on the orthopaedic in-training examination. JBJS Open Access. 2023;8(4):e23.00103. - PMC - PubMed
1. Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and Valid concerns. Healthcare. 2023;11(6):887. - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source

Affiliations

ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources