ChatGPT Performs at the Level of a Third-Year Orthopaedic Surgery Resident on the Orthopaedic In-Training Examination

Diane Ghanem¹, Oscar Covarrubias², Micheal Raad¹, Dawn LaPorte¹, Babar Shafiq¹

Affiliations

¹ Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland.
² School of Medicine, The Johns Hopkins University, Baltimore, Maryland.

PMID: 38638869
PMCID: PMC11025881
DOI: 10.2106/JBJS.OA.23.00103

Review

ChatGPT Performs at the Level of a Third-Year Orthopaedic Surgery Resident on the Orthopaedic In-Training Examination

Diane Ghanem et al. JB JS Open Access. 2023.

. 2023 Dec 11;8(4):e23.00103.

doi: 10.2106/JBJS.OA.23.00103. eCollection 2023 Oct-Dec.

Authors

Diane Ghanem¹, Oscar Covarrubias², Micheal Raad¹, Dawn LaPorte¹, Babar Shafiq¹

Affiliations

¹ Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland.
² School of Medicine, The Johns Hopkins University, Baltimore, Maryland.

PMID: 38638869
PMCID: PMC11025881
DOI: 10.2106/JBJS.OA.23.00103

Abstract

Introduction: Publicly available AI language models such as ChatGPT have demonstrated utility in text generation and even problem-solving when provided with clear instructions. Amidst this transformative shift, the aim of this study is to assess ChatGPT's performance on the orthopaedic surgery in-training examination (OITE).

Methods: All 213 OITE 2021 web-based questions were retrieved from the AAOS-ResStudy website (https://www.aaos.org/education/examinations/ResStudy). Two independent reviewers copied and pasted the questions and response options into ChatGPT Plus (version 4.0) and recorded the generated answers. All media-containing questions were flagged and carefully examined. Twelve OITE media-containing questions that relied purely on images (clinical pictures, radiographs, MRIs, CT scans) and could not be rationalized from the clinical presentation were excluded. Cohen's Kappa coefficient was used to examine the agreement of ChatGPT-generated responses between reviewers. Descriptive statistics were used to summarize the performance (% correct) of ChatGPT Plus. The 2021 norm table was used to compare ChatGPT Plus' performance on the OITE to national orthopaedic surgery residents in that same year.

Results: A total of 201 questions were evaluated by ChatGPT Plus. Excellent agreement was observed between raters for the 201 ChatGPT-generated responses, with a Cohen's Kappa coefficient of 0.947. 45.8% (92/201) were media-containing questions. ChatGPT had an average overall score of 61.2% (123/201). Its score was 64.2% (70/109) on non-media questions. When compared to the performance of all national orthopaedic surgery residents in 2021, ChatGPT Plus performed at the level of an average PGY3.

Discussion: ChatGPT Plus is able to pass the OITE with an overall score of 61.2%, ranking at the level of a third-year orthopaedic surgery resident. It provided logical reasoning and justifications that may help residents improve their understanding of OITE cases and general orthopaedic principles. Further studies are still needed to examine their efficacy and impact on long-term learning and OITE/ABOS performance.

PubMed Disclaimer

Figures

**Fig. 1**
An example of an excluded media-containing question that relies purely on the photographs and cannot be rationalized by ChatGPT Plus (version 4.0).

**Fig. 2**
A flow diagram outlining the selection of the 2021 OITE practice test questions available on the AAOS website. AAOS = American Academy of Orthopaedic Surgeons and OITE = Orthopaedic Surgery In-Training Examination.

**Fig. 3**
An example of a question prompt entry and correct response on ChatGPT Plus (version 4.0).

**Fig. 4**
An example of a radiograph-containing question prompt entry and correctly rationalized response on ChatGPT Plus (version 4.0).

See this image and copyright information in PMC

References

1. Bi AS. What's important: the next academic—ChatGPT AI? J Bone Joint Surg. 2023;105(11), 893-5. - PubMed
1. Dergaa I, Chamari K, Zmijewski P, Ben Saad H. From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing. Biol Sport. 2023;40(2):615-22. - PMC - PubMed
1. Bernstein J. Not the last word: ChatGPT can't perform orthopaedic surgery. Clin Orthop Relat Res. 2023;481(4):651-5. - PMC - PubMed
1. Mogali SR. Initial impressions of ChatGPT for anatomy education. Anat Sci Educ. 2023; 10.1002/ase.2261. - DOI - PubMed
1. Humar P, Asaad M, Bengur FB, Nguyen V. ChatGPT is equivalent to first year plastic surgery residents: evaluation of ChatGPT on the plastic surgery in-service exam. Aesthet Surg J. 2023;43(12):NP1085-NP1089. - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ChatGPT Performs at the Level of a Third-Year Orthopaedic Surgery Resident on the Orthopaedic In-Training Examination

Affiliations

ChatGPT Performs at the Level of a Third-Year Orthopaedic Surgery Resident on the Orthopaedic In-Training Examination

Authors

Affiliations

Abstract

Figures

References

Publication types

LinkOut - more resources

Full Text Sources