Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 27:26:e66114.
doi: 10.2196/66114.

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis

Affiliations

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis

Hui Zong et al. J Med Internet Res. .

Abstract

Background: Large language models (LLMs) are increasingly integrated into medical education, with transformative potential for learning and assessment. However, their performance across diverse medical exams globally has remained underexplored.

Objective: This study aims to introduce MedExamLLM, a comprehensive platform designed to systematically evaluate the performance of LLMs on medical exams worldwide. Specifically, the platform seeks to (1) compile and curate performance data for diverse LLMs on worldwide medical exams; (2) analyze trends and disparities in LLM capabilities across geographic regions, languages, and contexts; and (3) provide a resource for researchers, educators, and developers to explore and advance the integration of artificial intelligence in medical education.

Methods: A systematic search was conducted on April 25, 2024, in the PubMed database to identify relevant publications. Inclusion criteria encompassed peer-reviewed, English-language, original research articles that evaluated at least one LLM on medical exams. Exclusion criteria included review articles, non-English publications, preprints, and studies without relevant data on LLM performance. The screening process for candidate publications was independently conducted by 2 researchers to ensure accuracy and reliability. Data, including exam information, data process information, model performance, data availability, and references, were manually curated, standardized, and organized. These curated data were integrated into the MedExamLLM platform, enabling its functionality to visualize and analyze LLM performance across geographic, linguistic, and exam characteristics. The web platform was developed with a focus on accessibility, interactivity, and scalability to support continuous data updates and user engagement.

Results: A total of 193 articles were included for final analysis. MedExamLLM comprised information for 16 LLMs on 198 medical exams conducted in 28 countries across 15 languages from the year 2009 to the year 2023. The United States accounted for the highest number of medical exams and related publications, with English being the dominant language used in these exams. The Generative Pretrained Transformer (GPT) series models, especially GPT-4, demonstrated superior performance, achieving pass rates significantly higher than other LLMs. The analysis revealed significant variability in the capabilities of LLMs across different geographic and linguistic contexts.

Conclusions: MedExamLLM is an open-source, freely accessible, and publicly available online platform providing comprehensive performance evaluation information and evidence knowledge about LLMs on medical exams around the world. The MedExamLLM platform serves as a valuable resource for educators, researchers, and developers in the fields of clinical medicine and artificial intelligence. By synthesizing evidence on LLM capabilities, the platform provides valuable insights to support the integration of artificial intelligence into medical education. Limitations include potential biases in the data source and the exclusion of non-English literature. Future research should address these gaps and explore methods to enhance LLM performance in diverse contexts.

Keywords: AI; ChatGPT; LLMs; artifical intelligence; generative pretrained transformer; large language models; medical education; medical exam.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Overview of the study, including (A) the systematic search process for publications related to generative artificial intelligence and large language models (LLMs) on medical exams, (B) article screening and inclusion process used to select studies for analysis, and (C) key research questions addressed by the study. RQ: research question.
Figure 2
Figure 2
Overview of the structure and features on the MedExamLLM platform, including core modules such as the large language model (LLM) performance leaderboard, medical exam information search, and medical exam data set management, as well as statistics visualization, user submission of new data, and data set download.
Figure 3
Figure 3
World distribution of the medical exams in the MedExamLLM platform. The map was created using ECharts.
Figure 4
Figure 4
Capabilities of large language models across geographic and linguistic contexts, as indicated by the pass rates on medical exams across (A) 15 countries and (B) 8 languages. NA: not applicable.

Similar articles

Cited by

References

    1. Wu T, He S, Liu J, Sun S, Liu K, Han Q, Tang Y. A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J. Autom. Sinica. 2023 May;10(5):1122–1136. doi: 10.1109/jas.2023.123618. - DOI
    1. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Canton Ferrer C, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Singh Koura P, Lachaux MA, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith EM, Subramanian R, Tan XE, Tang B, Taylor R, Williams A, Kuan JX, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T. Llama 2: Open foundation and fine-tuned chat models. arXiv. doi: 10.48550/arXiv.2307.09288. Preprint posted online on July 19, 2023. - DOI
    1. Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI responds to common lung cancer questions: ChatGPT vs Google Bard. Radiology. 2023 Jun 01;307(5):e230922. doi: 10.1148/radiol.230922. - DOI - PubMed
    1. Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang P, Carroll A, Lau C, Tanno R, Ktena I, Palepu A, Mustafa B, Chowdhery A, Liu Y, Kornblith S, Fleet D, Mansfield P, Prakash S, Wong R, Virmani S, Semturs C, Mahdavi SS, Green B, Dominowska E, Arcas BAY, Barral J, Webster D, Corrado GS, Matias Y, Singhal K, Florence P, Karthikesalingam A, Natarajan V. Towards generalist biomedical AI. NEJM AI. 2024 Feb 22;1(3):1. doi: 10.1056/aioa2300138. - DOI
    1. Sahoo S, Plasek Joseph M, Xu Hua, Uzuner Özlem, Cohen Trevor, Yetisgen Meliha, Liu Hongfang, Meystre Stéphane, Wang Yanshan. Large language models for biomedicine: foundations, opportunities, challenges, and best practices. J Am Med Inform Assoc. 2024 Sep 01;31(9):2114–2124. doi: 10.1093/jamia/ocae074.7657768 - DOI - PMC - PubMed

LinkOut - more resources