Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2025 Dec;30(1):2534065.
doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.

Large language models in medical education: a comparative cross-platform evaluation in answering histological questions

Affiliations
Comparative Study

Large language models in medical education: a comparative cross-platform evaluation in answering histological questions

Volodymyr Mavrych et al. Med Educ Online. 2025 Dec.

Abstract

Large language models (LLMs) have shown promising capabilities across medical disciplines, yet their performance in basic medical sciences remains incompletely characterized. Medical histology, requiring factual knowledge and interpretative skills, provides a unique domain for evaluating AI capabilities in medical education. To evaluate and compare the performance of five current LLMs: GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash, Copilot, and DeepSeek R1 on correctly answering medical histology multiple choice questions (MCQs). This cross-sectional comparative study used 200 USMLE-style histology MCQs across 20 topics. Each LLM completed all the questions in three separate attempts. Performance metrics included accuracy rates, test-retest reliability (ICC), and topic-specific analysis. Statistical analysis employed ANOVA with post-hoc Tukey's tests and two-way mixed ANOVA for system-topic interactions. All LLMs achieved exceptionally high accuracy (Mean 91.1%, SD 7.2). Gemini performed best (92.0%), followed by Claude (91.5%), Copilot (91.0%), GPT-4 (90.8%), and DeepSeek (90.3%), with no significant differences between systems (p > 0.05). Claude showed the highest reliability (ICC = 0.931), followed by GPT-4 (ICC = 0.882). Complete accuracy and reproducibility (100%) were detected in Histological Methods, Blood and Hemopoiesis, and Circulatory System, while Muscle tissue (76.0%) and Lymphoid System (84.7%) presented the greatest challenges. LLMs demonstrate exceptional accuracy and reliability in answering histological MCQs, significantly outperforming other medical disciplines. Minimal inter-system variability suggests technological maturity, though topic-specific challenges and reliability concerns indicate the continued need for human expertise. These findings reflect rapid AI advancement and identify histology as particularly suitable for AI-assisted medical education.Clinical trial number: The clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions.

Keywords: ChatGPT; Claude; Copilot; DeepSeek; Gemini; Large language models; artificial intelligence; histology; medical education.

PubMed Disclaimer

Conflict of interest statement

No potential conflict of interest was reported by the author(s).

Figures

Figure 1.
Figure 1.
Performance comparison of large language models in medical histology MCQs. Bars represent mean accuracy percentages, and error bars indicate standard deviations. Sample size = 200 questions, with three attempts per question.
Figure 2.
Figure 2.
Topic-specific performance heatmap of LLMs in medical histology. The visualization displays accuracy percentages for each combination of histological topic (vertical axis) and LLM system (horizontal axis), with a “mean” column showing average performance across all systems.

References

    1. Feigerlova E, Hani H, Hothersall-Davies E.. A systematic review of the impact of artificial intelligence on educational outcomes in health professions education. BMC Med Educ. 2025;25(1):129. doi: 10.1186/s12909-025-06719-5 - DOI - PMC - PubMed
    1. Abuhassna H, Alnawajha S. Instructional design made Easy! Instructional design models, Categories, Frameworks, educational context, and Recommendations for Future Work. Eur J Investig Health Psychol Educ. 2023. Mar 30;13(4):715–11. doi: 10.3390/ejihpe13040054 - DOI - PMC - PubMed
    1. Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States medical licensing examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312. doi: 10.2196/45312 - DOI - PMC - PubMed
    1. Ch’en PY, Day W, Pekson RC, et al. GPT-4 generated answer rationales to multiple choice assessment questions in undergraduate medical education. BMC Med Educ. 2025;25(1):333. doi: 10.1186/s12909-025-06862-z - DOI - PMC - PubMed
    1. Bolgova O, Ganguly P, Mavrych V. Comparative analysis of LLMs performance in medical embryology: a cross-platform study of ChatGPT, Claude, Gemini, and Copilot. Anat Sci Educ. 2025;18(7):718–726. doi: 10.1002/ase.70044 - DOI - PubMed

Publication types

LinkOut - more resources