Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis

doi:10.2196/66114

. 2024 Dec 27:26:e66114.

doi: 10.2196/66114.

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis

Hui Zong^#¹, Rongrong Wu^#¹, Jiaxue Cha², Jiao Wang¹, Erman Wu^{1

3}, Jiakun Li^{1

4}, Yi Zhou¹, Chi Zhang¹, Weizhe Feng¹, Bairong Shen^{1

5}

Affiliations

¹ Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China.
² Shanghai Key Laboratory of Signaling and Disease Research, School of Life Sciences and Technology, Tongji University, Shanghai, China.
³ Department of Neurosurgery, First Affiliated Hospital of Xinjiang Medical University, Urumqi, China.
⁴ Department of Urology, West China Hospital, Sichuan University, Chengdu, China.
⁵ West China Tianfu Hospital, Sichuan University, Chengdu, China.

^# Contributed equally.

PMID: 39729356
PMCID: PMC11724220
DOI: 10.2196/66114

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis

Hui Zong et al. J Med Internet Res. 2024.

. 2024 Dec 27:26:e66114.

doi: 10.2196/66114.

Authors

Hui Zong^#¹, Rongrong Wu^#¹, Jiaxue Cha², Jiao Wang¹, Erman Wu^{1

3}, Jiakun Li^{1

4}, Yi Zhou¹, Chi Zhang¹, Weizhe Feng¹, Bairong Shen^{1

5}

Affiliations

¹ Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China.
² Shanghai Key Laboratory of Signaling and Disease Research, School of Life Sciences and Technology, Tongji University, Shanghai, China.
³ Department of Neurosurgery, First Affiliated Hospital of Xinjiang Medical University, Urumqi, China.
⁴ Department of Urology, West China Hospital, Sichuan University, Chengdu, China.
⁵ West China Tianfu Hospital, Sichuan University, Chengdu, China.

^# Contributed equally.

PMID: 39729356
PMCID: PMC11724220
DOI: 10.2196/66114

Abstract

Background: Large language models (LLMs) are increasingly integrated into medical education, with transformative potential for learning and assessment. However, their performance across diverse medical exams globally has remained underexplored.

Objective: This study aims to introduce MedExamLLM, a comprehensive platform designed to systematically evaluate the performance of LLMs on medical exams worldwide. Specifically, the platform seeks to (1) compile and curate performance data for diverse LLMs on worldwide medical exams; (2) analyze trends and disparities in LLM capabilities across geographic regions, languages, and contexts; and (3) provide a resource for researchers, educators, and developers to explore and advance the integration of artificial intelligence in medical education.

Methods: A systematic search was conducted on April 25, 2024, in the PubMed database to identify relevant publications. Inclusion criteria encompassed peer-reviewed, English-language, original research articles that evaluated at least one LLM on medical exams. Exclusion criteria included review articles, non-English publications, preprints, and studies without relevant data on LLM performance. The screening process for candidate publications was independently conducted by 2 researchers to ensure accuracy and reliability. Data, including exam information, data process information, model performance, data availability, and references, were manually curated, standardized, and organized. These curated data were integrated into the MedExamLLM platform, enabling its functionality to visualize and analyze LLM performance across geographic, linguistic, and exam characteristics. The web platform was developed with a focus on accessibility, interactivity, and scalability to support continuous data updates and user engagement.

Results: A total of 193 articles were included for final analysis. MedExamLLM comprised information for 16 LLMs on 198 medical exams conducted in 28 countries across 15 languages from the year 2009 to the year 2023. The United States accounted for the highest number of medical exams and related publications, with English being the dominant language used in these exams. The Generative Pretrained Transformer (GPT) series models, especially GPT-4, demonstrated superior performance, achieving pass rates significantly higher than other LLMs. The analysis revealed significant variability in the capabilities of LLMs across different geographic and linguistic contexts.

Conclusions: MedExamLLM is an open-source, freely accessible, and publicly available online platform providing comprehensive performance evaluation information and evidence knowledge about LLMs on medical exams around the world. The MedExamLLM platform serves as a valuable resource for educators, researchers, and developers in the fields of clinical medicine and artificial intelligence. By synthesizing evidence on LLM capabilities, the platform provides valuable insights to support the integration of artificial intelligence into medical education. Limitations include potential biases in the data source and the exclusion of non-English literature. Future research should address these gaps and explore methods to enhance LLM performance in diverse contexts.

Keywords: AI; ChatGPT; LLMs; artifical intelligence; generative pretrained transformer; large language models; medical education; medical exam.

©Hui Zong, Rongrong Wu, Jiaxue Cha, Jiao Wang, Erman Wu, Jiakun Li, Yi Zhou, Chi Zhang, Weizhe Feng, Bairong Shen. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 27.12.2024.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1**
Overview of the study, including (A) the systematic search process for publications related to generative artificial intelligence and large language models (LLMs) on medical exams, (B) article screening and inclusion process used to select studies for analysis, and (C) key research questions addressed by the study. RQ: research question.

**Figure 2**
Overview of the structure and features on the MedExamLLM platform, including core modules such as the large language model (LLM) performance leaderboard, medical exam information search, and medical exam data set management, as well as statistics visualization, user submission of new data, and data set download.

**Figure 3**
World distribution of the medical exams in the MedExamLLM platform. The map was created using ECharts.

**Figure 4**
Capabilities of large language models across geographic and linguistic contexts, as indicated by the pass rates on medical exams across (A) 15 countries and (B) 8 languages. NA: not applicable.

See this image and copyright information in PMC

Cited by

Enhancing ophthalmology students' awareness of retinitis pigmentosa: assessing the efficacy of ChatGPT in AI-assisted teaching of rare diseases-a quasi-experimental study.
Zeng J, Sun K, Qin P, Liu S. Zeng J, et al. Front Med (Lausanne). 2025 Mar 18;12:1534294. doi: 10.3389/fmed.2025.1534294. eCollection 2025. Front Med (Lausanne). 2025. PMID: 40171502 Free PMC article.
Advancements in Herpes Zoster Diagnosis, Treatment, and Management: Systematic Review of Artificial Intelligence Applications.
Wu D, Liu N, Ma R, Wu P. Wu D, et al. J Med Internet Res. 2025 Jun 30;27:e71970. doi: 10.2196/71970. J Med Internet Res. 2025. PMID: 40587773 Free PMC article. Review.
NDDRF 2.0: An update and expansion of risk factor knowledge base for personalized prevention of neurodegenerative diseases.
Bi C, Zheng X, Zhang Y, Zhou S, Song J, Shang H, Shen B. Bi C, et al. Alzheimers Dement. 2025 May;21(5):e70282. doi: 10.1002/alz.70282. Alzheimers Dement. 2025. PMID: 40371632 Free PMC article.
Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis.
Wei J, Wang X, Huang M, Xu Y, Yang W. Wei J, et al. J Med Syst. 2025 Jul 5;49(1):94. doi: 10.1007/s10916-025-02227-7. J Med Syst. 2025. PMID: 40615678
AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.
Pérez-Esteve C, Guilabert M, Matarredona V, Srulovici E, Tella S, Strametz R, Mira JJ. Pérez-Esteve C, et al. J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703. J Med Internet Res. 2025. PMID: 40294407 Free PMC article.

See all "Cited by" articles

References

1. Wu T, He S, Liu J, Sun S, Liu K, Han Q, Tang Y. A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J. Autom. Sinica. 2023 May;10(5):1122–1136. doi: 10.1109/jas.2023.123618. - DOI
1. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Canton Ferrer C, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Singh Koura P, Lachaux MA, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith EM, Subramanian R, Tan XE, Tang B, Taylor R, Williams A, Kuan JX, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T. Llama 2: Open foundation and fine-tuned chat models. arXiv. doi: 10.48550/arXiv.2307.09288. Preprint posted online on July 19, 2023. - DOI
1. Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI responds to common lung cancer questions: ChatGPT vs Google Bard. Radiology. 2023 Jun 01;307(5):e230922. doi: 10.1148/radiol.230922. - DOI - PubMed
1. Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang P, Carroll A, Lau C, Tanno R, Ktena I, Palepu A, Mustafa B, Chowdhery A, Liu Y, Kornblith S, Fleet D, Mansfield P, Prakash S, Wong R, Virmani S, Semturs C, Mahdavi SS, Green B, Dominowska E, Arcas BAY, Barral J, Webster D, Corrado GS, Matias Y, Singhal K, Florence P, Karthikesalingam A, Natarajan V. Towards generalist biomedical AI. NEJM AI. 2024 Feb 22;1(3):1. doi: 10.1056/aioa2300138. - DOI
1. Sahoo S, Plasek Joseph M, Xu Hua, Uzuner Özlem, Cohen Trevor, Yetisgen Meliha, Liu Hongfang, Meystre Stéphane, Wang Yanshan. Large language models for biomedicine: foundations, opportunities, challenges, and best practices. J Am Med Inform Assoc. 2024 Sep 01;31(9):2114–2124. doi: 10.1093/jamia/ocae074.7657768 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- JMIR Publications
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

[1] Wu T, He S, Liu J, Sun S, Liu K, Han Q, Tang Y. A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J. Autom. Sinica. 2023 May;10(5):1122–1136. doi: 10.1109/jas.2023.123618. - DOI

[2] Wu T, He S, Liu J, Sun S, Liu K, Han Q, Tang Y. A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J. Autom. Sinica. 2023 May;10(5):1122–1136. doi: 10.1109/jas.2023.123618. - DOI

[3] Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Canton Ferrer C, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Singh Koura P, Lachaux MA, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith EM, Subramanian R, Tan XE, Tang B, Taylor R, Williams A, Kuan JX, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T. Llama 2: Open foundation and fine-tuned chat models. arXiv. doi: 10.48550/arXiv.2307.09288. Preprint posted online on July 19, 2023. - DOI

[4] Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Canton Ferrer C, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Singh Koura P, Lachaux MA, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith EM, Subramanian R, Tan XE, Tang B, Taylor R, Williams A, Kuan JX, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T. Llama 2: Open foundation and fine-tuned chat models. arXiv. doi: 10.48550/arXiv.2307.09288. Preprint posted online on July 19, 2023. - DOI

[5] Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI responds to common lung cancer questions: ChatGPT vs Google Bard. Radiology. 2023 Jun 01;307(5):e230922. doi: 10.1148/radiol.230922. - DOI - PubMed

[6] Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI responds to common lung cancer questions: ChatGPT vs Google Bard. Radiology. 2023 Jun 01;307(5):e230922. doi: 10.1148/radiol.230922. - DOI - PubMed

[7] Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang P, Carroll A, Lau C, Tanno R, Ktena I, Palepu A, Mustafa B, Chowdhery A, Liu Y, Kornblith S, Fleet D, Mansfield P, Prakash S, Wong R, Virmani S, Semturs C, Mahdavi SS, Green B, Dominowska E, Arcas BAY, Barral J, Webster D, Corrado GS, Matias Y, Singhal K, Florence P, Karthikesalingam A, Natarajan V. Towards generalist biomedical AI. NEJM AI. 2024 Feb 22;1(3):1. doi: 10.1056/aioa2300138. - DOI

[8] Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang P, Carroll A, Lau C, Tanno R, Ktena I, Palepu A, Mustafa B, Chowdhery A, Liu Y, Kornblith S, Fleet D, Mansfield P, Prakash S, Wong R, Virmani S, Semturs C, Mahdavi SS, Green B, Dominowska E, Arcas BAY, Barral J, Webster D, Corrado GS, Matias Y, Singhal K, Florence P, Karthikesalingam A, Natarajan V. Towards generalist biomedical AI. NEJM AI. 2024 Feb 22;1(3):1. doi: 10.1056/aioa2300138. - DOI

[9] Sahoo S, Plasek Joseph M, Xu Hua, Uzuner Özlem, Cohen Trevor, Yetisgen Meliha, Liu Hongfang, Meystre Stéphane, Wang Yanshan. Large language models for biomedicine: foundations, opportunities, challenges, and best practices. J Am Med Inform Assoc. 2024 Sep 01;31(9):2114–2124. doi: 10.1093/jamia/ocae074.7657768 - DOI - PMC - PubMed

[10] Sahoo S, Plasek Joseph M, Xu Hua, Uzuner Özlem, Cohen Trevor, Yetisgen Meliha, Liu Hongfang, Meystre Stéphane, Wang Yanshan. Large language models for biomedicine: foundations, opportunities, challenges, and best practices. J Am Med Inform Assoc. 2024 Sep 01;31(9):2114–2124. doi: 10.1093/jamia/ocae074.7657768 - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis

Affiliations

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous