Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations
- PMID: 37581444
- DOI: 10.1227/neu.0000000000002632
Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations
Abstract
Background and objectives: Interest surrounding generative large language models (LLMs) has rapidly grown. Although ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, the performance of ChatGPT or its successor GPT-4 on specialized examinations and the factors affecting accuracy remain unclear. This study aims to assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written board examination.
Methods: The Self-Assessment Neurosurgery Examinations (SANS) American Board of Neurological Surgery Self-Assessment Examination 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests were used to assess performance differences in relation to question characteristics.
Results: ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% CI: 69.3%-77.2%) and 83.4% (95% CI: 79.8%-86.5%), respectively, relative to the user average of 72.8% (95% CI: 68.6%-76.6%). Both LLMs exceeded last year's passing threshold of 69%. Although scores between ChatGPT and question bank users were equivalent ( P = .963), GPT-4 outperformed both (both P < .001). GPT-4 answered every question answered correctly by ChatGPT and 37.6% (50/133) of remaining incorrect questions correctly. Among 12 question categories, GPT-4 significantly outperformed users in each but performed comparably with ChatGPT in 3 (functional, other general, and spine) and outperformed both users and ChatGPT for tumor questions. Increased word count (odds ratio = 0.89 of answering a question correctly per +10 words) and higher-order problem-solving (odds ratio = 0.40, P = .009) were associated with lower accuracy for ChatGPT, but not for GPT-4 (both P > .005). Multimodal input was not available at the time of this study; hence, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based on contextual context clues alone.
Conclusion: LLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT.
Copyright © Congress of Neurological Surgeons 2023. All rights reserved.
Similar articles
-
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12. Neurosurgery. 2023. PMID: 37306460
-
Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions.Neurosurg Rev. 2025 Mar 25;48(1):320. doi: 10.1007/s10143-025-03472-7. Neurosurg Rev. 2025. PMID: 40131528
-
GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions.World Neurosurg. 2023 Nov;179:e160-e165. doi: 10.1016/j.wneu.2023.08.042. Epub 2023 Aug 18. World Neurosurg. 2023. PMID: 37597659
-
Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models.J Cardiothorac Vasc Anesth. 2024 May;38(5):1251-1259. doi: 10.1053/j.jvca.2024.01.032. Epub 2024 Feb 1. J Cardiothorac Vasc Anesth. 2024. PMID: 38423884 Review.
-
Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.JB JS Open Access. 2023 Sep 8;8(3):e23.00056. doi: 10.2106/JBJS.OA.23.00056. eCollection 2023 Jul-Sep. JB JS Open Access. 2023. PMID: 37693092 Free PMC article. Review.
Cited by
-
Is ChatGPT 3.5 smarter than Otolaryngology trainees? A comparison study of board style exam questions.PLoS One. 2024 Sep 26;19(9):e0306233. doi: 10.1371/journal.pone.0306233. eCollection 2024. PLoS One. 2024. PMID: 39325705 Free PMC article.
-
Inadequate Performance of ChatGPT on Orthopedic Board-Style Written Exams.Cureus. 2024 Jun 18;16(6):e62643. doi: 10.7759/cureus.62643. eCollection 2024 Jun. Cureus. 2024. PMID: 39036109 Free PMC article.
-
Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom's Taxonomy.Adv Med Educ Pract. 2024 May 10;15:393-400. doi: 10.2147/AMEP.S457408. eCollection 2024. Adv Med Educ Pract. 2024. PMID: 38751805 Free PMC article.
-
Educational Limitations of ChatGPT in Neurosurgery Board Preparation.Cureus. 2024 Apr 20;16(4):e58639. doi: 10.7759/cureus.58639. eCollection 2024 Apr. Cureus. 2024. PMID: 38770467 Free PMC article.
-
Novel Evaluation Metric and Quantified Performance of ChatGPT-4 Patient Management Simulations for Early Clinical Education: Experimental Study.JMIR Form Res. 2025 Feb 27;9:e66478. doi: 10.2196/66478. JMIR Form Res. 2025. PMID: 40013991 Free PMC article.
References
-
- Oermann EK, Kondziolka D. On chatbots and generative artificial intelligence. Neurosurgery. 2023;92(4):665-666.
-
- Chen PHC, Liu Y, Peng L. How to develop machine learning models for healthcare. Nat Mater. 2019;18(5):410-414.
-
- OpenAI. GPT-4 Technical Report. 2023. Accessed March 27, 2023. https://cdn.openai.com/papers/gpt-4.pdf
-
- Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.
-
- Burk-Rafel J, Santen SA, Purkiss J. Study behaviors and USMLE step 1 performance: implications of a student self-directed parallel curriculum. Acad Med. 2017;92(11S Association of American Medical Colleges Learn Serve Lead: Proceedings of the 56th Annual Research in Medical Education Sessions):S67–S74.
References
-
- Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems; 2023. arXiv [cs.CL].
-
- Martínez E. Re-evaluating GPT-4’s bar exam performance. SSRN Electron J. 2023. 410-414.
-
- Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and Efficient Foundation Language Models; 2023. arXiv [cs.CL].
MeSH terms
LinkOut - more resources
Full Text Sources