Comparison of performance of artificial intelligence tools in answering emergency medicine question pool: ChatGPT 4.0, Google Gemini and Microsoft Copilot
- PMID: 40290213
- PMCID: PMC12022595
- DOI: 10.12669/pjms.41.4.11178
Comparison of performance of artificial intelligence tools in answering emergency medicine question pool: ChatGPT 4.0, Google Gemini and Microsoft Copilot
Abstract
Objective: Using artificial intelligence tools that work with different software architectures for both clinical and educational purposes in the medical field has been a subject of considerable interest recently. In this study, we compared the answers given by three different artificial intelligence chatbots to the Emergency Medicine question pool obtained from the questions asked in the Turkish National Medical Specialization Exam. We tried to investigate the effects on the answers given by classifying the questions in terms of content and form and examining the question sentences.
Methods: The questions related to emergency medicine of the Medical Specialization Exam questions between 2015-2020 were recorded. The questions were asked to artificial intelligence models, including ChatGPT-4, Gemini, and Copilot. The length of the questions, the question type and the topics of the wrong answers were recorded.
Results: The most successful chatbot in terms of total score was Microsoft Copilot (7.8% error margin), while the least successful was Google Gemini (22.9% error margin) (p<0.001). It was important that all chatbots had the highest error margins in questions about trauma and surgical approaches and made mistakes in burns and pediatrics. The increase in the error rates in questions containing the root "probability" also showed that the question style affected the answers given.
Conclusions: Although chatbots show promising success in determining the correct answer, we think that they should not see chatbots as a primary source for the exam, but rather as a good auxiliary tool to support their learning processes.
Keywords: Artificial Intelligence; ChatGPT; Copilot; Emergency medicine; Gemini; Medical education.
Copyright: © Pakistan Journal of Medical Sciences.
Conflict of interest statement
Conflict of interest: None.
References
-
- Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)?The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9(1):e45312. doi:10.2196/45312. - PMC - PubMed
-
- Waisberg E, Ong J, Masalkhi M, Kamran SA, Zaman N, Sarker P, et al. GPT-4:A new era of artificial intelligence in medicine. Ir J Med Sci. 2023;192(6):3197–3200. doi:10.1007/s11845-023-03377-8. - PubMed
-
- Rossettini G, Rodeghiero L, Corradi F, Cook C, Pillastrini P, Turolla A, et al. Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees:a cross-sectional study. BMC Med Educ. 2024;24(1):694. doi:10.1186/s12909-024-05630-9. - PMC - PubMed
LinkOut - more resources
Full Text Sources