The performance of ChatGPT and ERNIE Bot in surgical resident examinations
- PMID: 40220627
- DOI: 10.1016/j.ijmedinf.2025.105906
The performance of ChatGPT and ERNIE Bot in surgical resident examinations
Abstract
Study purpose: To assess the application of these two large language models (LLMs) for surgical resident examinations and to compare the performance of these LLMs with that of human residents.
Study design: In this study, 596 questions with a total of 183,556 responses were first included from the Medical Vision World, an authoritative medical education platform across China. Both Chinese prompted and non-prompted questions were input into ChatGPT-4.0 and ERNIE Bot-4.0 to compare their performance in a Chinese question database. Additionally, we screened another 210 surgical questions with detailed response results from 43 residents to compare the performance of residents and these two LLMs.
Results: There were no significant differences in the correctness of the responses to the 596 questions with or without prompts between the two LLMs (ChatGPT-4.0: 68.96 % [without prompt], 71.14 % [with prompts], p = 0.411; ERNIE Bot-4.0: 78.36 % [without prompt], 78.86 % [with prompts], p = 0.832), but ERNIE Bot-4.0 displayed higher correctness than ChatGPT-4.0 did (with prompts: p = 0.002; without prompts: p < 0.001). For another 210 questions with prompts, the two LLMs, especially ERNIE Bot-4.0 (ranking in the top 95 % of the 43 residents' scores), significantly outperformed the residents.
Conclusions: The performance of ERNIE Bot-4.0 was superior to that of ChatGPT-4.0 and that of residents on surgical resident examinations in a Chinese question database.
Keywords: Artificial intelligence; ChatGPT; ERNIE Bot; Medical examination.
Copyright © 2025 Elsevier B.V. All rights reserved.
Conflict of interest statement
Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Similar articles
-
Comparing the performance of ChatGPT and ERNIE Bot in answering questions regarding liver cancer interventional radiology in Chinese and English contexts: A comparative study.Digit Health. 2025 Jan 23;11:20552076251315511. doi: 10.1177/20552076251315511. eCollection 2025 Jan-Dec. Digit Health. 2025. PMID: 39850627 Free PMC article.
-
Comparative performance analysis of global and chinese-domain large language models for myopia.Eye (Lond). 2025 Jul;39(10):2015-2022. doi: 10.1038/s41433-025-03775-5. Epub 2025 Apr 13. Eye (Lond). 2025. PMID: 40223113
-
Application value of generative artificial intelligence in the field of stomatology.Hua Xi Kou Qiang Yi Xue Za Zhi. 2024 Dec 1;42(6):810-815. doi: 10.7518/hxkq.2024.2024144. Hua Xi Kou Qiang Yi Xue Za Zhi. 2024. PMID: 39610079 Free PMC article. Chinese, English.
-
Utility of artificial intelligence-based large language models in ophthalmic care.Ophthalmic Physiol Opt. 2024 May;44(3):641-671. doi: 10.1111/opo.13284. Epub 2024 Feb 25. Ophthalmic Physiol Opt. 2024. PMID: 38404172 Review.
-
ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review.Postgrad Med J. 2024 Oct 18;100(1189):858-865. doi: 10.1093/postmj/qgae065. Postgrad Med J. 2024. PMID: 38840505 Review.
MeSH terms
LinkOut - more resources
Full Text Sources