Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 29;24(1):354.
doi: 10.1186/s12909-024-05239-y.

Large language models for generating medical examinations: systematic review

Affiliations

Large language models for generating medical examinations: systematic review

Yaara Artsi et al. BMC Med Educ. .

Abstract

Background: Writing multiple choice questions (MCQs) for the purpose of medical exams is challenging. It requires extensive medical knowledge, time and effort from medical educators. This systematic review focuses on the application of large language models (LLMs) in generating medical MCQs.

Methods: The authors searched for studies published up to November 2023. Search terms focused on LLMs generated MCQs for medical examinations. Non-English, out of year range and studies not focusing on AI generated multiple-choice questions were excluded. MEDLINE was used as a search database. Risk of bias was evaluated using a tailored QUADAS-2 tool.

Results: Overall, eight studies published between April 2023 and October 2023 were included. Six studies used Chat-GPT 3.5, while two employed GPT 4. Five studies showed that LLMs can produce competent questions valid for medical exams. Three studies used LLMs to write medical questions but did not evaluate the validity of the questions. One study conducted a comparative analysis of different models. One other study compared LLM-generated questions with those written by humans. All studies presented faulty questions that were deemed inappropriate for medical exams. Some questions required additional modifications in order to qualify.

Conclusions: LLMs can be used to write MCQs for medical examinations. However, their limitations cannot be ignored. Further study in this field is essential and more conclusive evidence is needed. Until then, LLMs may serve as a supplementary tool for writing medical examinations. 2 studies were at high risk of bias. The study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.

Keywords: Artificial intelligence; Generative pre-trained transformer; Large language models; Medical education; Medical examination; Multiple choice questions.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Flow diagram of the search and inclusion process in the study. Flow Diagram of the Inclusion Process. Flow diagram of the search and inclusion process based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, November 2023
Fig. 2
Fig. 2
Risk of Bias and Applicability Judgments in QUADAS-2. QUADAS-2 table for potential bias and applicability. Risk of bias and applicability were evaluated using the tailored QUADAS-2 tool, November 2023
Fig. 3
Fig. 3
Illustration of multiple-choice questions (MCQs) generation and summary of preliminary results. A graphical illustration of MCQs generation and preliminary data. Upper row images were created using Chat-GPT 4 and DALI, illustrating the MCQs generation process via a large language model. The images created in the bottom row showcase preliminary data results, November 2023

Similar articles

Cited by

References

    1. Boniol M, Kunjumen T, Nair TS, Siyam A, Campbell J, Diallo K. The global health workforce stock and distribution in 2020 and 2030: a threat to equity and ‘universal’ health coverage? BMJ Glob Health. 2022;7(6):e009316. doi: 10.1136/bmjgh-2022-009316. - DOI - PMC - PubMed
    1. GBD 2019 Human Resources for Health Collaborators. Lancet. 2022;399(10341):2129–54. 10.1016/S0140-6736(22)00532-3. Measuring the availability of human resources for health and its relationship to universal health coverage for 204 countries and territories from 1990 to 2019: a systematic analysis for the Global Burden of Disease Study 2019. - PMC - PubMed
    1. Zhang X, Lin D, Pforsich H, Lin VW. Physician workforce in the United States of America: forecasting nationwide shortages. Hum Resour Health. 2020;18(1):8. doi: 10.1186/s12960-020-0448-3. - DOI - PMC - PubMed
    1. Rigby PG, Gururaja RP World medical schools: the sum also rises. JRSM Open. 2017;8(6):2054270417698631. doi: 10.1177/2054270417698631. - DOI - PMC - PubMed
    1. Hashem F, Marchand C, Peckham S, Peckham A. What are the impacts of setting up new medical schools? A narrative review. BMC Med Educ. 2022;22(1). 10.1186/s12909-022-03835. - PMC - PubMed

Publication types

LinkOut - more resources