Large language models for generating medical examinations: systematic review
- PMID: 38553693
- PMCID: PMC10981304
- DOI: 10.1186/s12909-024-05239-y
Large language models for generating medical examinations: systematic review
Abstract
Background: Writing multiple choice questions (MCQs) for the purpose of medical exams is challenging. It requires extensive medical knowledge, time and effort from medical educators. This systematic review focuses on the application of large language models (LLMs) in generating medical MCQs.
Methods: The authors searched for studies published up to November 2023. Search terms focused on LLMs generated MCQs for medical examinations. Non-English, out of year range and studies not focusing on AI generated multiple-choice questions were excluded. MEDLINE was used as a search database. Risk of bias was evaluated using a tailored QUADAS-2 tool.
Results: Overall, eight studies published between April 2023 and October 2023 were included. Six studies used Chat-GPT 3.5, while two employed GPT 4. Five studies showed that LLMs can produce competent questions valid for medical exams. Three studies used LLMs to write medical questions but did not evaluate the validity of the questions. One study conducted a comparative analysis of different models. One other study compared LLM-generated questions with those written by humans. All studies presented faulty questions that were deemed inappropriate for medical exams. Some questions required additional modifications in order to qualify.
Conclusions: LLMs can be used to write MCQs for medical examinations. However, their limitations cannot be ignored. Further study in this field is essential and more conclusive evidence is needed. Until then, LLMs may serve as a supplementary tool for writing medical examinations. 2 studies were at high risk of bias. The study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.
Keywords: Artificial intelligence; Generative pre-trained transformer; Large language models; Medical education; Medical examination; Multiple choice questions.
© 2024. The Author(s).
Conflict of interest statement
The authors declare no competing interests.
Figures



Similar articles
-
Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114. J Med Internet Res. 2024. PMID: 39729356 Free PMC article.
-
AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination.BMC Med Educ. 2025 Feb 8;25(1):208. doi: 10.1186/s12909-025-06796-6. BMC Med Educ. 2025. PMID: 39923067 Free PMC article.
-
The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.J Med Internet Res. 2024 Nov 5;26:e56532. doi: 10.2196/56532. J Med Internet Res. 2024. PMID: 39499913 Free PMC article.
-
Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT.Med Teach. 2024 Aug;46(8):1021-1026. doi: 10.1080/0142159X.2023.2294703. Epub 2023 Dec 26. Med Teach. 2024. PMID: 38146711
-
Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.J Clin Epidemiol. 2025 May;181:111746. doi: 10.1016/j.jclinepi.2025.111746. Epub 2025 Feb 26. J Clin Epidemiol. 2025. PMID: 40021099
Cited by
-
Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114. J Med Internet Res. 2024. PMID: 39729356 Free PMC article.
-
Assessing LLM-generated vs. expert-created clinical anatomy MCQs: a student perception-based comparative study in medical education.Med Educ Online. 2025 Dec;30(1):2554678. doi: 10.1080/10872981.2025.2554678. Epub 2025 Aug 30. Med Educ Online. 2025. PMID: 40884796 Free PMC article.
-
Advancing Clinical Practice: The Potential of Multimodal Technology in Modern Medicine.J Clin Med. 2024 Oct 19;13(20):6246. doi: 10.3390/jcm13206246. J Clin Med. 2024. PMID: 39458196 Free PMC article.
-
Situating governance and regulatory concerns for generative artificial intelligence and large language models in medical education.NPJ Digit Med. 2025 May 27;8(1):315. doi: 10.1038/s41746-025-01721-z. NPJ Digit Med. 2025. PMID: 40425695 Free PMC article. Review.
-
We Live in Interesting Times: Introduction to the Special Section on Big Data & Behavior Science.Perspect Behav Sci. 2024 Mar 7;47(1):197-202. doi: 10.1007/s40614-024-00400-w. eCollection 2024 Mar. Perspect Behav Sci. 2024. PMID: 38660502 Free PMC article. No abstract available.
References
-
- GBD 2019 Human Resources for Health Collaborators. Lancet. 2022;399(10341):2129–54. 10.1016/S0140-6736(22)00532-3. Measuring the availability of human resources for health and its relationship to universal health coverage for 204 countries and territories from 1990 to 2019: a systematic analysis for the Global Burden of Disease Study 2019. - PMC - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous