Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 10.
doi: 10.1002/bcp.70137. Online ahead of print.

Evaluating and leveraging large language models in clinical pharmacology and therapeutics assessment: From exam takers to exam shapers

Affiliations

Evaluating and leveraging large language models in clinical pharmacology and therapeutics assessment: From exam takers to exam shapers

Alexandre O Gérard et al. Br J Clin Pharmacol. .

Abstract

Aims: In medical education, the ability of large language models (LLMs) to match human performance raises questions about their potential as educational tools. This study evaluates LLMs' performance on Clinical Pharmacology and Therapeutics (CPT) exams, comparing their results to medical students and exploring their ability to identify poorly formulated multiple-choice questions (MCQs).

Methods: ChatGPT-4 Omni, Gemini Advanced, Le Chat and DeepSeek R1 were tested on local CPT exams (third year of bachelor's degree, first/second year of master's degree) and the European Prescribing Exam (EuroPE+). The exams included MCQs and open-ended questions assessing knowledge and prescribing skills. LLM results were analysed using the same scoring system as students. A confusion matrix was used to evaluate the ability of ChatGPT and Gemini to identify ambiguous/erroneous MCQs.

Results: LLMs achieved comparable or superior results to medical students across all levels. For local exams, LLMs outperformed M1 students and matched L3 and M2 students. In EuroPE+, LLMs significantly outperformed students in both the knowledge and prescribing skills sections. All LLM errors in EuroPE+ were genuine (100%), whereas local exam errors were frequently due to ambiguities or correction flaws (24.3%). When both ChatGPT and Gemini provided the same incorrect answer to an MCQ, the specificity for detecting ambiguous questions was 92.9%, with a negative predictive value of 85.5%.

Conclusion: LLMs demonstrate capabilities comparable to or exceeding medical students in CPT exams. Their ability to flag potentially flawed MCQs highlights their value not only as educational tools but also as quality control instruments in exam preparation.

Keywords: large language model; medical study; pedagogy; pharmacology, therapeutics.

PubMed Disclaimer

References

REFERENCES

    1. Bubeck S, Chandrasekaran V, Eldan R, et al. Sparks of artificial general intelligence: early experiments with GPT‐4. arXiv.org. 2023;abs/2303.12712. doi:10.48550/arXiv.2303.12712
    1. Liu Y, Chen L, Yao Z. The application of artificial intelligence assistant to deep learning in teachers' teaching and students' learning processes. Front Psychol. 2022;13:929175. doi:10.3389/fpsyg.2022.929175
    1. Katz DM, Bommarito MJ, Gao S, Arredondo P. GPT‐4 passes the bar exam. Phil Trans R Soc a. 2024;382(2270):20230254. doi:10.1098/rsta.2023.0254
    1. Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare. 2023;11(6):887. doi:10.3390/healthcare11060887
    1. Schmidgall S., Harris C., Essien I., Olshvang D., Rahman T., Kim J.W. et al. Addressing cognitive bias in medical language models. 2024. doi:10.48550/arXiv.2402.08113

LinkOut - more resources