Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2025 Jul;18(7):718-726.
doi: 10.1002/ase.70044. Epub 2025 May 11.

Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot

Affiliations
Comparative Study

Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot

Olena Bolgova et al. Anat Sci Educ. 2025 Jul.

Abstract

Integrating artificial intelligence, particularly large language models (LLMs), into medical education represents a significant new step in how medical knowledge is accessed, processed, and evaluated. The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots in different topics of medical embryology courses. Two hundred United States Medical Licensing Examination (USMLE)-style multiple-choice questions were selected from the course exam database and distributed across 20 topics. The results of 3 attempts by GPT-4o, Claude, Gemini, Copilot, and GPT-3.5 to answer the assessment items were evaluated. Statistical analyses included intraclass correlation coefficients for reliability, one-way and two-way mixed ANOVAs for performance comparisons, and post hoc analyses. Effect sizes were calculated using Cohen's f and eta-squared (η2). On average, the selected chatbots correctly answered 78.7% ± 15.1% of the questions. GPT-4o and Claude performed best, correctly answering 89.7% and 87.5% of the questions, respectively, without a statistical difference in their performance (p = 0.238). The performance of other chatbots was significantly lower (p < 0.01): Copilot (82.5%), Gemini (74.8%), and GPT-3.5 (59.0%). Test-retest reliability analysis showed good reliability for GPT-4o (ICC = 0.803), Claude (ICC = 0.865), and Gemini (ICC = 0.876), with moderate reliability for Copilot and GPT-3.5. This study suggests that AI models like GPT-4o and Claude show promise for providing tailored embryology instruction, though instructor verification remains essential.

Keywords: ChatGPT; Claude; Copilot; Gemini; artificial intelligence; embryology; large language models; medical education.

PubMed Disclaimer

References

REFERENCES

    1. Abd‐Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9:e48291. https://doi.org/10.2196/48291
    1. Roy AD, Das D, Mondal H. Efficacy of ChatGPT in solving attitude, ethics, and communication case scenario used for competency‐based medical education in India: a case study. J Educ Health Promot. 2024;13:22. https://doi.org/10.4103/jehp.jehp_625_23
    1. Moldt JA, Festl‐Wietek T, Madany Mamlouk A, Nieselt K, Fuhl W, Herrmann‐Werner A. Chatbots for future docs: exploring medical students' attitudes and knowledge towards artificial intelligence and medical chatbots. Med Educ Online. 2023;28(1):2182659. https://doi.org/10.1080/10872981.2023.2182659
    1. Wang C, Li S, Lin N, Zhang X, Han Y, Wang X, et al. Application of large language models in medical training evaluation‐using ChatGPT as a standardized patient: multimetric assessment. J Med Internet Res. 2025;27:e59435. https://doi.org/10.2196/59435
    1. Chen A, Chen DO, Tian L. Benchmarking the symptom‐checking capabilities of ChatGPT for a broad range of diseases. J Am Med Inform Assoc. 2024;31(9):2084–2088. https://doi.org/10.1093/jamia/ocad245

Publication types

MeSH terms

LinkOut - more resources