Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 11;16(3):e55991.
doi: 10.7759/cureus.55991. eCollection 2024 Mar.

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions

Affiliations

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions

Ali Abbas et al. Cureus. .

Abstract

Introduction: Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.

Methods: The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).

Results: A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.

Keywords: artificial intelligence (ai); artificial intelligence and education; artificial intelligence in medicine; chatgpt; claude; google's bard; gpt-4; large language model; nbme subject exam; united states medical licensing examination (usmle).

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Overall performance of the LLMs on all NBME sample questions
LLM: large language model, NBME: National Board of Medical Examiners
Figure 2
Figure 2. Performance of the LLMs on each subject exam
LLM: large language model

Similar articles

Cited by

References

    1. Application of artificial intelligence in medicine: an overview. Liu PR, Lu L, Zhang JY, Huo TT, Liu SX, Ye ZW. Curr Med Sci. 2021;41:1105–1115. - PMC - PubMed
    1. Artificial intelligence to support clinical decision-making processes. Garcia-Vidal C, Sanjuan G, Puerta-Alcalde P, Moreno-García E, Soriano A. EBioMedicine. 2019;46:27–29. - PMC - PubMed
    1. Artificial intelligence: the future for diabetes care. Ellahham S. Am J Med. 2020;133:895–900. - PubMed
    1. Large language models encode clinical knowledge. Singhal K, Azizi S, Tu T, et al. Nature. 2023;620:172–180. - PMC - PubMed
    1. A comprehensive overview of large language models. Naveed H, Khan AU, Qiu S, et al. arXiv. 2023

LinkOut - more resources