Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Dec 29:2025.12.03.25341549.
doi: 10.64898/2025.12.03.25341549.

Drug-drug interaction identification using large language models

Affiliations

Drug-drug interaction identification using large language models

Kaitlin Blotske et al. medRxiv. .

Abstract

Background: Drug-drug interactions (DDIs) are a significant source of morbidity and adverse drug events (ADEs), particularly in situations of polypharmacy and complex medication regimens. While rules-based software integrated in electronic health records (EHRs) has demonstrated proficiency in identifying DDIs present in medication regimens, large language model (LLM) based identification requires thorough benchmarking and performance evaluation using high-quality datasets for safe use. The purpose of this study was to develop a series of performance benchmarking experiments specifically for LLM performance in identification and management of DDIs using a specifically curated clinician-annotated dataset of clinically-relevant DDIs.

Methods: We evaluated three LLMs (GPT-4o-mini, MedGemma-27B, LLaMA3-70B) using a clinician-annotated benchmark dataset of 750 DDI scenarios spanning three levels of diagnostic complexity. Tasks were aligned with flexible judgment formats: (1) a pointwise two-drug classification task, (2) a pairwise three-drug discrimination task, and (3) a listwise 4-6 drug selection task. Standardized zero-shot prompting with task-specific instructions was applied for all models. Performance was assessed using precision, recall, F1 score, and accuracy. Reliability was quantified using self-consistency across repeated runs and confidence-aligned metrics to capture stability in model reasoning.

Results: Across the three experiments, model performance varied by task structure and interaction severity. LLaMA3-70B demonstrated the highest recall and F1 score in the pointwise task, whereas GPT-4o-mini achieved superior accuracy and consistency in the pairwise and listwise tasks. MedGemma-27B showed competitive performance in identifying Category D interactions. Self-consistency decreased as task complexity increased, highlighting reduced stability in multi-drug reasoning. No model exhibited uniformly high reliability across all judgment formats.

Conclusions: Current LLMs show promising but uneven capabilities in identifying DDIs across clinically relevant task structures. Performance degrades as the reasoning space expands, and stability across repeated queries remains limited. These findings emphasize the need for multi-format evaluation frameworks and reliability-aware assessment when considering LLMs for medication-safety applications.

Keywords: artificial intelligence; healthcare; large language model; medications; pharmacy.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: The authors have no conflicts of interest.

Figures

Figure 1.
Figure 1.
Radar Plot of Large Language Model Performance on Each Task Across all three models, the pairwise task is consistently easier than the pointwise task. This echoes patterns reported in other LLM evaluation work, where models tend to perform better when they can compare options directly rather than classify a single item in isolation. The results here follow the same trend: giving models two choices seems to reduce uncertainty and improve decision quality. Overall, the figure highlights that each model has a different “strength profile.”

References

    1. Warraich HJ, Tazbaz T, Califf RM. FDA Perspective on the Regulation of Artificial Intelligence in Health Care and Biomedicine. Jama. Jan 21 2025;333(3):241–247. doi: 10.1001/jama.2024.21451 - DOI - PubMed
    1. Al-Ashwal FY, Zawiah M, Gharaibeh L, Abu-Farha R, Bitar AN. Evaluating the Sensitivity, Specificity, and Accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard Against Conventional Drug-Drug Interactions Clinical Tools. Drug Healthc Patient Saf. 2023;15:137–147. doi: 10.2147/dhps.S425858 - DOI - PMC - PubMed
    1. Albogami Y, Alfakhri A, Alaqil A, et al. Safety and quality of AI chatbots for drug-related inquiries: A real-world comparison with licensed pharmacists. Digit Health. Jan–Dec 2024;10:20552076241253523. doi: 10.1177/20552076241253523 - DOI - PMC - PubMed
    1. Bischof T, Al Jalali V, Zeitlinger M, et al. Chat GPT vs. Clinical Decision Support Systems in the Analysis of Drug-Drug Interactions. Clin Pharmacol Ther. Apr 2025;117(4):1142–1147. doi: 10.1002/cpt.3585 - DOI - PMC - PubMed
    1. Fournier A, Fallet C, Sadeghipour F, Perrottet N. Assessing the applicability and appropriateness of ChatGPT in answering clinical pharmacy questions. Ann Pharm Fr. May 2024;82(3):507–513. doi: 10.1016/j.pharma.2023.11.001 - DOI - PubMed

Publication types

LinkOut - more resources