Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 2;8(1):239.
doi: 10.1038/s41746-025-01651-w.

Leveraging long context in retrieval augmented language models for medical question answering

Affiliations

Leveraging long context in retrieval augmented language models for medical question answering

Gongbo Zhang et al. NPJ Digit Med. .

Abstract

While holding great promise for improving and facilitating healthcare through applications of medical literature summarization, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a search engine and external sources of knowledge. However, the quality of RAG responses can be largely impacted by the rank and density of key information in the retrieval results, such as the "lost-in-the-middle" problem. In this work, we aim to improve the robustness and reliability of the RAG workflow in the medical domain. Specifically, we propose a map-reduce strategy, BriefContext, to combat the "lost-in-the-middle" issue without modifying the model weights. We demonstrated the advantage of the workflow with various LLM backbones and on multiple QA datasets. This method promises to improve the safety and reliability of LLMs deployed in healthcare domains by reducing the risk of misinformation, ensuring critical clinical content is retained in generated responses, and enabling more trustworthy use of LLMs in critical tasks such as medical question answering, clinical decision support, and patient-facing applications.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Workflow of BriefContext.
In the Context Map operation (1), the retrieved documents are divided into multiple partitions to create multiple RAG subtasks. In the Context Reduce operation (2), the responses were collected from the previous step and summarized into a final response.
Fig. 2
Fig. 2. Relationship between QA accuracy and positions of key information in the LLM context.
We show the average and standard deviation of accuracy of: a, b GPT-3.5-Turbo, c, d Mixtral-7x8b. The quartiles refer to the positions where the key document is located. Significance levels: *p < 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001; ns Not significant.
Fig. 3
Fig. 3. Integration testing of BriefContext with different LLM backbones.
We show the accuracy of various settings with different foundation models: a Llama3-70B-instruct, b Llama2-70B-chat, c Mixtral-7x8b, and d GPT-3.5-turbo-0125. BC Brief Context. RAG Retrieval-augmented generation. CoT Chain-of-Thought. Significance levels: *p < 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001; ns Not significant.
Fig. 4
Fig. 4. Analysis of cases with conflicting context information.
Number of cases (red) with conflict information provided to LLMs and number of correctly resolved cases (green): a Mixtral-7x8b, b GPT-3.5-turbo-0125.
Fig. 5
Fig. 5. Medical QA accuracy of LLMs with various numbers of documents as context information.
We show the mean and standard deviation of accuracy with different number of documents in the context window. The top solid line shows the performance in the Oracle settings. The bottom dotted line shows the performance of CoT. With the same key document in the context, the accuracy decreases as the number of documents increases. a Llama3-70B-instruct, b Llama2-70B-chat, c Mixtral-7x8b, and d GPT-3.5-turbo-0125. BC Brief Context. RAG Retrieval-augmented generation. CoT Chain-of-Thought. Significance levels: *p < 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001; ns Not significant.
Fig. 6
Fig. 6. Relationship between QA accuracy and different context information.
We show the average mean and standard deviation of accuracy with the real retrieval and controlled settings as the context. In the Control group, all documents come from results returned by MedCPT. In the experimental group, the context consists of key documents and others selected at random from the knowledge base. a Llama3-70B-instruct, b Llama2-70B-chat, c Mixtral-7x8b, and d GPT-3.5-turbo-0125ontext. RAG retrieval-augmented generation. CoT chain-of-thought. Significance levels: *p < 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001; ns Not significant.

Similar articles

Cited by

References

    1. Singhal, K. et al. Large language models encode clinical knowledge. Nature620, 172–180 (2023). - PMC - PubMed
    1. Jin, Q. et al. MedCPT: contrastive pre-trained transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics39, btad651 (2023). - PMC - PubMed
    1. Haupt, C. E. & Marks, M. AI-Generated Medical Advice—GPT and Beyond. JAMA329, 1349–1350 (2023). - PubMed
    1. Peng, Y., Rousseau, J. F., Shortliffe, E. H. & Weng, C. AI-generated text may have a role in evidence-based medicine. Nat. Med.29, 1593–1594 (2023). - PMC - PubMed
    1. Zhang, G. et al. A span-based model for extracting overlapping PICO entities from randomized controlled trial publications. J. Am. Med. Inform. Assoc.31, 1163–1171 (2024). - PMC - PubMed

LinkOut - more resources