Leveraging long context in retrieval augmented language models for medical question answering

doi:10.1038/s41746-025-01651-w

. 2025 May 2;8(1):239.

doi: 10.1038/s41746-025-01651-w.

Leveraging long context in retrieval augmented language models for medical question answering

Gongbo Zhang¹, Zihan Xu², Qiao Jin³, Fangyi Chen¹, Yilu Fang¹, Yi Liu⁴, Justin F Rousseau^{5

6}, Ziyang Xu⁷, Zhiyong Lu³, Chunhua Weng⁸, Yifan Peng⁹

Affiliations

¹ Department of Biomedical Informatics, Columbia University, New York, NY, USA.
² Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
³ Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
⁴ Division of Endocrinology, Department of Medicine, Diabetes and Metabolism, Weill Cornell Medical College, New York, NY, USA.
⁵ Department of Neurology, University of Texas Southwestern Medical Center, Dallas, TX, USA.
⁶ Peter O'Donnell Jr. Brain Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA.
⁷ Department of Dermatology, NYU Grossman School of Medicine, New York, NY, USA.
⁸ Department of Biomedical Informatics, Columbia University, New York, NY, USA. cw2384@cumc.columbia.edu.
⁹ Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA. yip4002@med.cornell.edu.

PMID: 40316710
PMCID: PMC12048518
DOI: 10.1038/s41746-025-01651-w

Leveraging long context in retrieval augmented language models for medical question answering

Gongbo Zhang et al. NPJ Digit Med. 2025.

. 2025 May 2;8(1):239.

doi: 10.1038/s41746-025-01651-w.

Authors

Gongbo Zhang¹, Zihan Xu², Qiao Jin³, Fangyi Chen¹, Yilu Fang¹, Yi Liu⁴, Justin F Rousseau^{5

6}, Ziyang Xu⁷, Zhiyong Lu³, Chunhua Weng⁸, Yifan Peng⁹

Affiliations

¹ Department of Biomedical Informatics, Columbia University, New York, NY, USA.
² Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
³ Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
⁴ Division of Endocrinology, Department of Medicine, Diabetes and Metabolism, Weill Cornell Medical College, New York, NY, USA.
⁵ Department of Neurology, University of Texas Southwestern Medical Center, Dallas, TX, USA.
⁶ Peter O'Donnell Jr. Brain Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA.
⁷ Department of Dermatology, NYU Grossman School of Medicine, New York, NY, USA.
⁸ Department of Biomedical Informatics, Columbia University, New York, NY, USA. cw2384@cumc.columbia.edu.
⁹ Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA. yip4002@med.cornell.edu.

PMID: 40316710
PMCID: PMC12048518
DOI: 10.1038/s41746-025-01651-w

Abstract

While holding great promise for improving and facilitating healthcare through applications of medical literature summarization, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a search engine and external sources of knowledge. However, the quality of RAG responses can be largely impacted by the rank and density of key information in the retrieval results, such as the "lost-in-the-middle" problem. In this work, we aim to improve the robustness and reliability of the RAG workflow in the medical domain. Specifically, we propose a map-reduce strategy, BriefContext, to combat the "lost-in-the-middle" issue without modifying the model weights. We demonstrated the advantage of the workflow with various LLM backbones and on multiple QA datasets. This method promises to improve the safety and reliability of LLMs deployed in healthcare domains by reducing the risk of misinformation, ensuring critical clinical content is retained in generated responses, and enabling more trustworthy use of LLMs in critical tasks such as medical question answering, clinical decision support, and patient-facing applications.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

**Fig. 1. Workflow of BriefContext.**
In the Context Map operation (1), the retrieved documents are divided into multiple partitions to create multiple RAG subtasks. In the Context Reduce operation (2), the responses were collected from the previous step and summarized into a final response.

**Fig. 2. Relationship between QA accuracy and positions of key information in the LLM context.**
We show the average and standard deviation of accuracy of: a, b GPT-3.5-Turbo, c, d Mixtral-7x8b. The quartiles refer to the positions where the key document is located. Significance levels: *p < 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001; ns Not significant.

**Fig. 3. Integration testing of BriefContext with different LLM backbones.**
We show the accuracy of various settings with different foundation models: a Llama3-70B-instruct, b Llama2-70B-chat, c Mixtral-7x8b, and d GPT-3.5-turbo-0125. BC Brief Context. RAG Retrieval-augmented generation. CoT Chain-of-Thought. Significance levels: *p < 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001; ns Not significant.

**Fig. 4. Analysis of cases with conflicting context information.**
Number of cases (red) with conflict information provided to LLMs and number of correctly resolved cases (green): a Mixtral-7x8b, b GPT-3.5-turbo-0125.

**Fig. 5. Medical QA accuracy of LLMs with various numbers of documents as context information.**
We show the mean and standard deviation of accuracy with different number of documents in the context window. The top solid line shows the performance in the Oracle settings. The bottom dotted line shows the performance of CoT. With the same key document in the context, the accuracy decreases as the number of documents increases. a Llama3-70B-instruct, b Llama2-70B-chat, c Mixtral-7x8b, and d GPT-3.5-turbo-0125. BC Brief Context. RAG Retrieval-augmented generation. CoT Chain-of-Thought. Significance levels: *p < 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001; ns Not significant.

**Fig. 6. Relationship between QA accuracy and different context information.**
We show the average mean and standard deviation of accuracy with the real retrieval and controlled settings as the context. In the Control group, all documents come from results returned by MedCPT. In the experimental group, the context consists of key documents and others selected at random from the knowledge base. a Llama3-70B-instruct, b Llama2-70B-chat, c Mixtral-7x8b, and d GPT-3.5-turbo-0125ontext. RAG retrieval-augmented generation. CoT chain-of-thought. Significance levels: *p < 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001; ns Not significant.

See this image and copyright information in PMC

Cited by

Federated Knowledge Retrieval Elevates Large Language Model Performance on Biomedical Benchmarks.
Joy J, Su AI. Joy J, et al. bioRxiv [Preprint]. 2025 Aug 2:2025.08.01.668022. doi: 10.1101/2025.08.01.668022. bioRxiv. 2025. PMID: 40766637 Free PMC article. Preprint.
Using large language models to generate child-friendly education materials on myopia.
Li X, Zhang Y, Zheng T, Deng Y, Lu Y, Hu J, Chen S, Li Y, Wang K. Li X, et al. Digit Health. 2025 Jul 30;11:20552076251362338. doi: 10.1177/20552076251362338. eCollection 2025 Jan-Dec. Digit Health. 2025. PMID: 40755959 Free PMC article.
CLEAR: A vision to support clinical evidence lifecycle with continuous learning.
Fang Y, Zhang G, Chen F, Hripcsak G, Peng Y, Ryan P, Weng C. Fang Y, et al. J Biomed Inform. 2025 Jul 29;169:104884. doi: 10.1016/j.jbi.2025.104884. Online ahead of print. J Biomed Inform. 2025. PMID: 40744316

References

1. Singhal, K. et al. Large language models encode clinical knowledge. Nature620, 172–180 (2023). - PMC - PubMed
1. Jin, Q. et al. MedCPT: contrastive pre-trained transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics39, btad651 (2023). - PMC - PubMed
1. Haupt, C. E. & Marks, M. AI-Generated Medical Advice—GPT and Beyond. JAMA329, 1349–1350 (2023). - PubMed
1. Peng, Y., Rousseau, J. F., Shortliffe, E. H. & Weng, C. AI-generated text may have a role in evidence-based medicine. Nat. Med.29, 1593–1594 (2023). - PMC - PubMed
1. Zhang, G. et al. A span-based model for extracting overlapping PICO entities from randomized controlled trial publications. J. Am. Med. Inform. Assoc.31, 1163–1171 (2024). - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

[1] Singhal, K. et al. Large language models encode clinical knowledge. Nature620, 172–180 (2023). - PMC - PubMed

[2] Singhal, K. et al. Large language models encode clinical knowledge. Nature620, 172–180 (2023). - PMC - PubMed

[3] Jin, Q. et al. MedCPT: contrastive pre-trained transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics39, btad651 (2023). - PMC - PubMed

[4] Jin, Q. et al. MedCPT: contrastive pre-trained transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics39, btad651 (2023). - PMC - PubMed

[5] Haupt, C. E. & Marks, M. AI-Generated Medical Advice—GPT and Beyond. JAMA329, 1349–1350 (2023). - PubMed

[6] Haupt, C. E. & Marks, M. AI-Generated Medical Advice—GPT and Beyond. JAMA329, 1349–1350 (2023). - PubMed

[7] Peng, Y., Rousseau, J. F., Shortliffe, E. H. & Weng, C. AI-generated text may have a role in evidence-based medicine. Nat. Med.29, 1593–1594 (2023). - PMC - PubMed

[8] Peng, Y., Rousseau, J. F., Shortliffe, E. H. & Weng, C. AI-generated text may have a role in evidence-based medicine. Nat. Med.29, 1593–1594 (2023). - PMC - PubMed

[9] Zhang, G. et al. A span-based model for extracting overlapping PICO entities from randomized controlled trial publications. J. Am. Med. Inform. Assoc.31, 1163–1171 (2024). - PMC - PubMed

[10] Zhang, G. et al. A span-based model for extracting overlapping PICO entities from randomized controlled trial publications. J. Am. Med. Inform. Assoc.31, 1163–1171 (2024). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Leveraging long context in retrieval augmented language models for medical question answering

Affiliations

Leveraging long context in retrieval augmented language models for medical question answering

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources