This is a preprint.
Drug-drug interaction identification using large language models
- PMID: 41503479
- PMCID: PMC12772686
- DOI: 10.64898/2025.12.03.25341549
Drug-drug interaction identification using large language models
Abstract
Background: Drug-drug interactions (DDIs) are a significant source of morbidity and adverse drug events (ADEs), particularly in situations of polypharmacy and complex medication regimens. While rules-based software integrated in electronic health records (EHRs) has demonstrated proficiency in identifying DDIs present in medication regimens, large language model (LLM) based identification requires thorough benchmarking and performance evaluation using high-quality datasets for safe use. The purpose of this study was to develop a series of performance benchmarking experiments specifically for LLM performance in identification and management of DDIs using a specifically curated clinician-annotated dataset of clinically-relevant DDIs.
Methods: We evaluated three LLMs (GPT-4o-mini, MedGemma-27B, LLaMA3-70B) using a clinician-annotated benchmark dataset of 750 DDI scenarios spanning three levels of diagnostic complexity. Tasks were aligned with flexible judgment formats: (1) a pointwise two-drug classification task, (2) a pairwise three-drug discrimination task, and (3) a listwise 4-6 drug selection task. Standardized zero-shot prompting with task-specific instructions was applied for all models. Performance was assessed using precision, recall, F1 score, and accuracy. Reliability was quantified using self-consistency across repeated runs and confidence-aligned metrics to capture stability in model reasoning.
Results: Across the three experiments, model performance varied by task structure and interaction severity. LLaMA3-70B demonstrated the highest recall and F1 score in the pointwise task, whereas GPT-4o-mini achieved superior accuracy and consistency in the pairwise and listwise tasks. MedGemma-27B showed competitive performance in identifying Category D interactions. Self-consistency decreased as task complexity increased, highlighting reduced stability in multi-drug reasoning. No model exhibited uniformly high reliability across all judgment formats.
Conclusions: Current LLMs show promising but uneven capabilities in identifying DDIs across clinically relevant task structures. Performance degrades as the reasoning space expands, and stability across repeated queries remains limited. These findings emphasize the need for multi-format evaluation frameworks and reliability-aware assessment when considering LLMs for medication-safety applications.
Keywords: artificial intelligence; healthcare; large language model; medications; pharmacy.
Conflict of interest statement
Conflicts of Interest: The authors have no conflicts of interest.
Figures
References
-
- Al-Ashwal FY, Zawiah M, Gharaibeh L, Abu-Farha R, Bitar AN. Evaluating the Sensitivity, Specificity, and Accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard Against Conventional Drug-Drug Interactions Clinical Tools. Drug Healthc Patient Saf. 2023;15:137–147. doi: 10.2147/dhps.S425858 - DOI - PMC - PubMed
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources