Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Sep 5;9(3):e24.00099.
doi: 10.2106/JBJS.OA.24.00099. eCollection 2024 Jul-Sep.

ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source

Affiliations
Review

ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source

Diane Ghanem et al. JB JS Open Access. .

Abstract

Introduction: The artificial intelligence language model Chat Generative Pretrained Transformer (ChatGPT) has shown potential as a reliable and accessible educational resource in orthopaedic surgery. Yet, the accuracy of the references behind the provided information remains elusive, which poses a concern for maintaining the integrity of medical content. This study aims to examine the accuracy of the references provided by ChatGPT-4 concerning the Airway, Breathing, Circulation, Disability, Exposure (ABCDE) approach in trauma surgery.

Methods: Two independent reviewers critically assessed 30 ChatGPT-4-generated references supporting the well-established ABCDE approach to trauma protocol, grading them as 0 (nonexistent), 1 (inaccurate), or 2 (accurate). All discrepancies between the ChatGPT-4 and PubMed references were carefully reviewed and bolded. Cohen's Kappa coefficient was used to examine the agreement of the accuracy scores of the ChatGPT-4-generated references between reviewers. Descriptive statistics were used to summarize the mean reference accuracy scores. To compare the variance of the means across the 5 categories, one-way analysis of variance was used.

Results: ChatGPT-4 had an average reference accuracy score of 66.7%. Of the 30 references, only 43.3% were accurate and deemed "true" while 56.7% were categorized as "false" (43.3% inaccurate and 13.3% nonexistent). The accuracy was consistent across the 5 trauma protocol categories, with no significant statistical difference (p = 0.437).

Discussion: With 57% of references being inaccurate or nonexistent, ChatGPT-4 has fallen short in providing reliable and reproducible references-a concerning finding for the safety of using ChatGPT-4 for professional medical decision making without thorough verification. Only if used cautiously, with cross-referencing, can this language model act as an adjunct learning tool that can enhance comprehensiveness as well as knowledge rehearsal and manipulation.

PubMed Disclaimer

Conflict of interest statement

Disclosure: The Disclosure of Potential Conflicts of Interest forms are provided with the online version of the article (http://links.lww.com/JBJSOA/A667).

Figures

Fig. 1
Fig. 1
ChatGPT-generated answer (version 4.0) regarding the ABCDE approach to trauma protocol. ABCDE = Airway, Breathing, Circulation, Disability, Exposure, and ChatGPT = Chat Generative Pretrained Transformer.
Fig. 2
Fig. 2
ChatGPT-generated scientific references to support each of the 5 steps of the ABCDE approach to the trauma protocol. ABCDE = Airway, Breathing, Circulation, Disability, Exposure, and ChatGPT = Chat Generative Pretrained Transformer.
Fig. 3
Fig. 3
Pie chart showing the accuracy of the ChatGPT-generated references categorized by “true” or “false.” ChatGPT = Chat Generative Pretrained Transformer.

References

    1. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New Engl J Med. 2023;388(13):1233-9. - PubMed
    1. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. - PMC - PubMed
    1. Kung JE, Marshall C, Gauthier C, Gonzalez TA, Jackson JB. Evaluating ChatGPT performance on the orthopaedic in-training examination. JBJS Open Access. 2023;8(3):e23.00056. - PMC - PubMed
    1. Ghanem D, Covarrubias O, Raad M, LaPorte D, Shafiq B. ChatGPT performs at the level of a third-year orthopaedic surgery resident on the orthopaedic in-training examination. JBJS Open Access. 2023;8(4):e23.00103. - PMC - PubMed
    1. Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and Valid concerns. Healthcare. 2023;11(6):887. - PMC - PubMed

LinkOut - more resources