Answering real-world clinical questions using large language model, retrieval-augmented generation, and agentic systems

Yen Sia Low¹, Michael L Jackson¹, Rebecca J Hyde¹, Robert E Brown¹, Neil M Sanghavi¹, Julian D Baldwin¹, C William Pike¹, Jananee Muralidharan¹, Gavin Hui^{1

2}, Natasha Alexander³, Hadeel Hassan^{4

5}, Rahul V Nene⁶, Morgan Pike⁷, Courtney J Pokrzywa⁸, Shivam Vedak⁹, Adam Paul Yan³, Dong-Han Yao¹⁰, Amy R Zipursky³, Christina Dinh¹, Philip Ballentine¹, Dan C Derieg¹, Vladimir Polony¹, Rehan N Chawdry¹, Jordan Davies¹, Brigham B Hyde¹, Nigam H Shah^{1

9}, Saurabh Gombar^{1

11}

Affiliations

¹ Atropos Health, New York, NY, USA.
² Department of Medicine, University of California, Los Angeles, CA, USA.
³ Department of Pediatrics, The Hospital for Sick Children, Toronto, Ontario, Canada.
⁴ Division of Hematology/Oncology, The Hospital for Sick Children, Toronto Ontario, Canada.
⁵ Program in Child Health Evaluative Sciences, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, Toronto, Ontario, Canada.
⁶ Department of Emergency Medicine, University of California, San Diego, CA, USA.
⁷ Department of Emergency Medicine, University of Michigan, Ann Arbor, MI, USA.
⁸ Department of Surgery, Columbia University, New York, NY, USA.
⁹ Division of Clinical Informatics, Stanford University, Stanford, CA, USA.
¹⁰ Department of Emergency Medicine, Stanford University, Stanford, CA, USA.
¹¹ Department of Pathology, Stanford University, Stanford, CA, USA.

PMID: 40510193
PMCID: PMC12159471
DOI: 10.1177/20552076251348850

Answering real-world clinical questions using large language model, retrieval-augmented generation, and agentic systems

Yen Sia Low et al. Digit Health. 2025.

. 2025 Jun 9:11:20552076251348850.

doi: 10.1177/20552076251348850. eCollection 2025 Jan-Dec.

Authors

Affiliations

¹ Atropos Health, New York, NY, USA.
² Department of Medicine, University of California, Los Angeles, CA, USA.
³ Department of Pediatrics, The Hospital for Sick Children, Toronto, Ontario, Canada.
⁴ Division of Hematology/Oncology, The Hospital for Sick Children, Toronto Ontario, Canada.
⁵ Program in Child Health Evaluative Sciences, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, Toronto, Ontario, Canada.
⁶ Department of Emergency Medicine, University of California, San Diego, CA, USA.
⁷ Department of Emergency Medicine, University of Michigan, Ann Arbor, MI, USA.
⁸ Department of Surgery, Columbia University, New York, NY, USA.
⁹ Division of Clinical Informatics, Stanford University, Stanford, CA, USA.
¹⁰ Department of Emergency Medicine, Stanford University, Stanford, CA, USA.
¹¹ Department of Pathology, Stanford University, Stanford, CA, USA.

PMID: 40510193
PMCID: PMC12159471
DOI: 10.1177/20552076251348850

Abstract

Objective: The practice of evidence-based medicine can be challenging when relevant data are lacking or difficult to contextualize for a specific patient. Large language models (LLMs) could potentially address both challenges by summarizing published literature or generating new studies using real-world data.

Materials and methods: We submitted 50 clinical questions to five LLM-based systems: OpenEvidence, which uses an LLM for retrieval-augmented generation (RAG); ChatRWD, which uses an LLM as an interface to a data extraction and analysis pipeline; and three general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini 1.5 Pro). Nine independent physicians evaluated the answers for relevance, quality of supporting evidence, and actionability (i.e., sufficient to justify or change clinical practice).

Results: General-purpose LLMs rarely produced relevant, evidence-based answers (2-10% of questions). In contrast, RAG-based and agentic LLM systems, respectively, produced relevant, evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. OpenEvidence produced actionable results for 48% of questions with existing evidence, compared to 37% for ChatRWD and <5% for the general-purpose LLMs. ChatRWD provided actionable results for 52% of questions that lacked existing literature compared to <10% for other LLMs.

Discussion: Special-purpose LLM systems greatly outperformed general-purpose LLMs in producing answers to clinical questions. Retrieval-augmented generation-based LLM (OpenEvidence) performed well when existing data were available, while only the agentic ChatRWD was able to provide actionable answers when preexisting studies were lacking.

Conclusion: Synergistic systems combining RAG-based evidence summarization and agentic generation of novel evidence could improve the availability of pertinent evidence for patient care.

Keywords: Artificial intelligence; cohort study; evidence-based medicine; large language models; retrieval-augmented generation.

PubMed Disclaimer

Conflict of interest statement

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: ChatRWD, the LLM system evaluated in this study, is developed by Atropos Health where many of the authors are employed. NHS is not an Atropos Health employee but sits on its board. OpenEvidence, another LLM system evaluated here, is provided by OpenEvidence whom we consulted during the writing of this manuscript. Non-Atropos employees NA, HH, RVN, MP, CJP, SV, APY, D-HY, and ARZ, have nothing to disclose.

Figures

**Figure 2.**
Performance of large language model (LLM) systems stratified by question novelty.

See this image and copyright information in PMC

References

1. Sackett DL, Rosenberg WM, Gray JAM, et al. Evidence based medicine: what it is and what it isn’t. Br Med J 1996; 312: 71–72. - PMC - PubMed
1. Darst JR, Newburger JW, Resch S, et al. Deciding without data. Congenit Heart Dis 2010; 5: 339–342. - PMC - PubMed
1. Ishman SL, Tang A, Cohen AP, et al. Decision making for children with obstructive sleep apnea without tonsillar hypertrophy. Otolaryngol Head Neck Surg 2016; 154: 527–531. - PubMed
1. He J, Morales DR, Guthrie B. Exclusion rates in randomized controlled trials of treatments for physical conditions: a systematic review. Trials 2020; 21: 228. - PMC - PubMed
1. Fanaroff AC, Califf RM, Windecker S, et al. Levels of evidence supporting American College of Cardiology/American Heart Association and European Society of Cardiology Guidelines, 2008–2018. JAMA 2019; 321: 1069–1080. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Atypon
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Answering real-world clinical questions using large language model, retrieval-augmented generation, and agentic systems

Affiliations

Answering real-world clinical questions using large language model, retrieval-augmented generation, and agentic systems

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources