Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 16;17(1):e77550.
doi: 10.7759/cureus.77550. eCollection 2025 Jan.

Evaluating the Performance of ChatGPT4.0 Versus ChatGPT3.5 on the Hand Surgery Self-Assessment Exam: A Comparative Analysis of Performance on Image-Based Questions

Affiliations

Evaluating the Performance of ChatGPT4.0 Versus ChatGPT3.5 on the Hand Surgery Self-Assessment Exam: A Comparative Analysis of Performance on Image-Based Questions

Kiera L Vrindten et al. Cureus. .

Abstract

Hypothesis The emergence of ChatGPT as an artificial intelligence (AI) platform has become an increasingly useful tool in medical education, especially within resident education to supplement preparation for certification exams. As the AI model inevitably progresses, there is an increased need to establish ChatGPT's accuracy in specialty knowledge. Our study assesses the performance of ChatGPT4.0 on self-assessment questions pertaining to hand surgery in comparison to the performance of its predecessor ChatGPT3.5. A distinct feature of ChatGPT4.0 is its ability to interpret visual input which ChatGPT3.5 cannot. We hypothesize that ChatGPT4.0 will perform better on image-based questions than ChatGPT3.5. Methods This study used 10 self-assessment exams from 2004 to 2013 from the American Society for Surgery of the Hand (ASSH). Performance on image-based questions was compared between ChatGPT4.0 and ChatGPT3.5. The primary outcome was the total score as a proportion of answers correct. Secondary outcomes were the proportion of questions for which ChatGPT4.0 provided elaborations, the length of those elaborations, and the number of questions for which ChatGPT4.0 provided answers with confidence. Descriptive analysis, Student's t-test, and one-way ANOVA tests were used for data analysis. Results Out of 455 image-based questions, there was no statistically significant difference in the total score between ChatGPT4.0 and ChatGPT3.5. ChatGPT4.0 answered 137 (30.1%) questions correctly while ChatGPT3.5 answered 131 (28.7%) correctly (p= 0.805). Although there was no significant difference in the length or frequency of elaborations in relation to the proportion of correct answers between the two versions, ChatGPT4.0 did provide significantly longer explanations overall compared to ChatGPT3.5 (p<0.05). Moreover, of the 455 total image-based questions, ChatGPT4.0 provided significantly less confident answers compared to ChatGPT3.5 (p<0.05). Of those responses in which ChatGPT4.0 expressed uncertainty, there was a significant difference based on image type, with the highest uncertainty stemming from question stems involving radiograph-based images (p<0.001). Summary points Overall, there was no significant difference in performance between ChatGPT4.0 and ChatGPT3.5 when answering image-based questions on the ASSH self-assessment examinations. Notably, however, ChatGPT4.0 expressed more uncertainty with answers. Further exploration of how AI-generated responses influence user behavior in clinical and educational settings will be crucial to optimizing the role of AI in healthcare.

Keywords: ai; certification; chatgpt; education; self-assessment.

PubMed Disclaimer

Conflict of interest statement

Human subjects: All authors have confirmed that this study did not involve human participants or tissue. Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue. Conflicts of interest: In compliance with the ICMJE uniform disclosure form, all authors declare the following: Payment/services info: All authors have declared that no financial support was received from any organization for the submitted work. Financial relationships: All authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work. Other relationships: All authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work.

Figures

Figure 1
Figure 1. Performance and elaboration length between both versions of ChatGPT
ChatGPT4.0 provided longer explanations than ChatGPT3.5 (P < 0.05, t stat=-7.73 for image-based questions only. ChatGPT3.5 data from the study by Han et al. [10].

References

    1. ChatGPT - reshaping medical education and clinical management. Khan RA, Jawaid M, Khan AR, Sajjad M. Pak J Med Sci. 2023;39:605–607. - PMC - PubMed
    1. ChatGPT in surgical practice—a new kid on the block. Bhattacharya K, Bhattacharya AS, Bhattacharya N, Yagnik VD, Garg P, Kumar S. Indian J Surg. 2023;85:1346–1349.
    1. ChatGPT in orthopedics: a narrative review exploring the potential of artificial intelligence in orthopedic practice. Giorgino R, Alessandri-Bonetti M, Luca A, Migliorini F, Rossi N, Peretti GM, Mangiavini L. Front Surg. 2023;10:1284015. - PMC - PubMed
    1. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. Kung TH, Cheatham M, Medenilla A, et al. PLOS Digit Health. 2023;2:0. - PMC - PubMed
    1. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. JMIR Med Educ. 2023;9:0. - PMC - PubMed

LinkOut - more resources