Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study
- PMID: 40737609
- PMCID: PMC12310146
- DOI: 10.2196/69313
Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study
Abstract
Background: Artificial intelligence and large language models (LLMs)-particularly GPT-4 and GPT-4o-have demonstrated high correct-answer rates in medical examinations. GPT-4o has enhanced diagnostic capabilities, advanced image processing, and updated knowledge. Japanese surgeons face critical challenges, including a declining workforce, regional health care disparities, and work-hour-related challenges. Nonetheless, although LLMs could be beneficial in surgical education, no studies have yet assessed GPT-4o's surgical knowledge or its performance in the field of surgery.
Objective: This study aims to evaluate the potential of GPT-4 and GPT-4o in surgical education by using them to take the Japan Surgical Board Examination (JSBE), which includes both textual questions and medical images-such as surgical and computed tomography scans-to comprehensively assess their surgical knowledge.
Methods: We used 297 multiple-choice questions from the 2021-2023 JSBEs. The questions were in Japanese, and 104 of them included images. First, the GPT-4 and GPT-4o responses to only the textual questions were collected via OpenAI's application programming interface to evaluate their correct-answer rate. Subsequently, the correct-answer rate of their responses to questions that included images was assessed by inputting both text and images.
Results: The overall correct-answer rates of GPT-4o and GPT-4 for the text-only questions were 78% (231/297) and 55% (163/297), respectively, with GPT-4o outperforming GPT-4 by 23% (P=<.01). By contrast, there was no significant improvement in the correct-answer rate for questions that included images compared with the results for the text-only questions.
Conclusions: GPT-4o outperformed GPT-4 on the JSBE. However, the results of the LLMs were lower than those of the examinees. Despite the capabilities of LLMs, image recognition remains a challenge for them, and their clinical application requires caution owing to the potential inaccuracy of their results.
Keywords: ChatGPT; Japan Surgical Board Examination; LLM; Medical Licensing Examination; artificial intelligence; diagnostic imaging; large language models; surgical education.
© Hiroki Maruyama, Yoshitaka Toyama, Kentaro Takanami, Kei Takase, Takashi Kamei. Originally published in JMIR Medical Education (https://mededu.jmir.org).
Conflict of interest statement
Figures
Similar articles
-
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592. JMIR Form Res. 2024. PMID: 39714199 Free PMC article.
-
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910. J Med Internet Res. 2025. PMID: 40392576 Free PMC article.
-
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807. J Med Internet Res. 2024. PMID: 39052324 Free PMC article.
-
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146. J Med Internet Res. 2025. PMID: 39919278 Free PMC article.
-
Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis.BMC Med Educ. 2024 Sep 16;24(1):1013. doi: 10.1186/s12909-024-05944-8. BMC Med Educ. 2024. PMID: 39285377 Free PMC article.
References
-
- Overview of statistics on doctors, dentists [Article in Japanese] Ministry of Health Labour and Welfare. 2024. [16-07-2025]. https://www.mhlw.go.jp/toukei/saikin/hw/ishi/22/index.html URL. Accessed.
-
- Work style reform for doctors [Article in Japanese] Ministry of Health, Labour and Welfare. 2024. [16-07-2025]. https://www.mhlw.go.jp/content/10800000/001129457.pdf URL. Accessed.
-
- ChatGPT. Open AI. 2024. [16-07-2025]. https://openai.com/chatgpt/ URL. Accessed.
MeSH terms
LinkOut - more resources
Full Text Sources