Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot
- PMID: 40350555
- DOI: 10.1002/ase.70044
Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot
Abstract
Integrating artificial intelligence, particularly large language models (LLMs), into medical education represents a significant new step in how medical knowledge is accessed, processed, and evaluated. The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots in different topics of medical embryology courses. Two hundred United States Medical Licensing Examination (USMLE)-style multiple-choice questions were selected from the course exam database and distributed across 20 topics. The results of 3 attempts by GPT-4o, Claude, Gemini, Copilot, and GPT-3.5 to answer the assessment items were evaluated. Statistical analyses included intraclass correlation coefficients for reliability, one-way and two-way mixed ANOVAs for performance comparisons, and post hoc analyses. Effect sizes were calculated using Cohen's f and eta-squared (η2). On average, the selected chatbots correctly answered 78.7% ± 15.1% of the questions. GPT-4o and Claude performed best, correctly answering 89.7% and 87.5% of the questions, respectively, without a statistical difference in their performance (p = 0.238). The performance of other chatbots was significantly lower (p < 0.01): Copilot (82.5%), Gemini (74.8%), and GPT-3.5 (59.0%). Test-retest reliability analysis showed good reliability for GPT-4o (ICC = 0.803), Claude (ICC = 0.865), and Gemini (ICC = 0.876), with moderate reliability for Copilot and GPT-3.5. This study suggests that AI models like GPT-4o and Claude show promise for providing tailored embryology instruction, though instructor verification remains essential.
Keywords: ChatGPT; Claude; Copilot; Gemini; artificial intelligence; embryology; large language models; medical education.
© 2025 American Association for Anatomy.
References
REFERENCES
-
- Abd‐Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9:e48291. https://doi.org/10.2196/48291
-
- Roy AD, Das D, Mondal H. Efficacy of ChatGPT in solving attitude, ethics, and communication case scenario used for competency‐based medical education in India: a case study. J Educ Health Promot. 2024;13:22. https://doi.org/10.4103/jehp.jehp_625_23
-
- Moldt JA, Festl‐Wietek T, Madany Mamlouk A, Nieselt K, Fuhl W, Herrmann‐Werner A. Chatbots for future docs: exploring medical students' attitudes and knowledge towards artificial intelligence and medical chatbots. Med Educ Online. 2023;28(1):2182659. https://doi.org/10.1080/10872981.2023.2182659
-
- Wang C, Li S, Lin N, Zhang X, Han Y, Wang X, et al. Application of large language models in medical training evaluation‐using ChatGPT as a standardized patient: multimetric assessment. J Med Internet Res. 2025;27:e59435. https://doi.org/10.2196/59435
-
- Chen A, Chen DO, Tian L. Benchmarking the symptom‐checking capabilities of ChatGPT for a broad range of diseases. J Am Med Inform Assoc. 2024;31(9):2084–2088. https://doi.org/10.1093/jamia/ocad245
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources