Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot

Olena Bolgova¹, Paul Ganguly¹, Volodymyr Mavrych¹

Affiliations

PMID: 40350555
DOI: 10.1002/ase.70044

Comparative Study

Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot

Olena Bolgova et al. Anat Sci Educ. 2025 Jul.

. 2025 Jul;18(7):718-726.

doi: 10.1002/ase.70044. Epub 2025 May 11.

Authors

Olena Bolgova¹, Paul Ganguly¹, Volodymyr Mavrych¹

Affiliation

¹ College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia.

PMID: 40350555
DOI: 10.1002/ase.70044

Abstract

Integrating artificial intelligence, particularly large language models (LLMs), into medical education represents a significant new step in how medical knowledge is accessed, processed, and evaluated. The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots in different topics of medical embryology courses. Two hundred United States Medical Licensing Examination (USMLE)-style multiple-choice questions were selected from the course exam database and distributed across 20 topics. The results of 3 attempts by GPT-4o, Claude, Gemini, Copilot, and GPT-3.5 to answer the assessment items were evaluated. Statistical analyses included intraclass correlation coefficients for reliability, one-way and two-way mixed ANOVAs for performance comparisons, and post hoc analyses. Effect sizes were calculated using Cohen's f and eta-squared (η²). On average, the selected chatbots correctly answered 78.7% ± 15.1% of the questions. GPT-4o and Claude performed best, correctly answering 89.7% and 87.5% of the questions, respectively, without a statistical difference in their performance (p = 0.238). The performance of other chatbots was significantly lower (p < 0.01): Copilot (82.5%), Gemini (74.8%), and GPT-3.5 (59.0%). Test-retest reliability analysis showed good reliability for GPT-4o (ICC = 0.803), Claude (ICC = 0.865), and Gemini (ICC = 0.876), with moderate reliability for Copilot and GPT-3.5. This study suggests that AI models like GPT-4o and Claude show promise for providing tailored embryology instruction, though instructor verification remains essential.

Keywords: ChatGPT; Claude; Copilot; Gemini; artificial intelligence; embryology; large language models; medical education.

PubMed Disclaimer

References

REFERENCES

1. Abd‐Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9:e48291. https://doi.org/10.2196/48291
1. Roy AD, Das D, Mondal H. Efficacy of ChatGPT in solving attitude, ethics, and communication case scenario used for competency‐based medical education in India: a case study. J Educ Health Promot. 2024;13:22. https://doi.org/10.4103/jehp.jehp_625_23
1. Moldt JA, Festl‐Wietek T, Madany Mamlouk A, Nieselt K, Fuhl W, Herrmann‐Werner A. Chatbots for future docs: exploring medical students' attitudes and knowledge towards artificial intelligence and medical chatbots. Med Educ Online. 2023;28(1):2182659. https://doi.org/10.1080/10872981.2023.2182659
1. Wang C, Li S, Lin N, Zhang X, Han Y, Wang X, et al. Application of large language models in medical training evaluation‐using ChatGPT as a standardized patient: multimetric assessment. J Med Internet Res. 2025;27:e59435. https://doi.org/10.2196/59435
1. Chen A, Chen DO, Tian L. Benchmarking the symptom‐checking capabilities of ChatGPT for a broad range of diseases. J Am Med Inform Assoc. 2024;31(9):2084–2088. https://doi.org/10.1093/jamia/ocad245

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Wiley

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot

Affiliation

Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot

Authors

Affiliation

Abstract

References

REFERENCES

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources