Federated Knowledge Retrieval Elevates Large Language Model Performance on Biomedical Benchmarks

doi:10.1101/2025.08.01.668022

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Aug 2:2025.08.01.668022.

doi: 10.1101/2025.08.01.668022.

Federated Knowledge Retrieval Elevates Large Language Model Performance on Biomedical Benchmarks

Janet Joy¹, Andrew I Su¹

Affiliations

PMID: 40766637
PMCID: PMC12324469
DOI: 10.1101/2025.08.01.668022

Federated Knowledge Retrieval Elevates Large Language Model Performance on Biomedical Benchmarks

Janet Joy et al. bioRxiv. 2025.

[Preprint]. 2025 Aug 2:2025.08.01.668022.

doi: 10.1101/2025.08.01.668022.

Authors

Janet Joy¹, Andrew I Su¹

Affiliation

¹ Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, USA.

PMID: 40766637
PMCID: PMC12324469
DOI: 10.1101/2025.08.01.668022

Abstract

Background: Large language models (LLMs) have significantly advanced natural language processing in biomedical research, however, their reliance on implicit, statistical representations often results in factual inaccuracies or hallucinations, posing significant concerns in high-stakes biomedical contexts.

Results: To overcome these limitations, we developed BTE-RAG, a retrieval-augmented generation framework that integrates the reasoning capabilities of advanced language models with explicit mechanistic evidence sourced from BioThings Explorer, an API federation of more than sixty authoritative biomedical knowledge sources. We systematically evaluated BTE-RAG in comparison to traditional LLM-only methods across three benchmark datasets that we created from DrugMechDB. These datasets specifically targeted gene-centric mechanisms (798 questions), metabolite effects (201 questions), and drug-biological process relationships (842 questions). On the gene-centric task, BTE-RAG increased accuracy from 51% to 75.8% for GPT-4o mini and from 69.8% to 78.6% for GPT-4o. In metabolite-focused questions, the proportion of responses with cosine similarity scores of at least 0.90 rose by 82% for GPT-4o mini and 77% for GPT-4o. While overall accuracy was consistent in the drug-biological process benchmark, the retrieval method enhanced response concordance, producing a greater than 10% increase in high-agreement answers (from 129 to 144) using GPT-4o.

Conclusion: Federated knowledge retrieval provides transparent improvements in accuracy for large language models, establishing BTE-RAG as a valuable and practical tool for mechanistic exploration and translational biomedical research.

PubMed Disclaimer

Conflict of interest statement

Competing Interests The authors declare no competing interests.

Figures

**Figure 1:. Retrieval-Augmented Generation workflow and derivation of mechanistic evaluation benchmarks.**
**(A)** Schematic of the BTE-RAG pipeline, which augments large language model (LLM) responses with context retrieved from the BioThings Explorer (BTE) knowledge graph. In the LLM-only pathway, the model generates a response using only the input question. In contrast, BTE-RAG operates in two phases: a Retrieval Phase, where relevant entities are extracted from the question and queried against BTE to collect mechanistically relevant subject–predicate–object triples, and a Generative Phase, where this curated context is appended to the input question and passed to the LLM. The resulting outputs: LLM-only or BTE-RAG, can be directly compared to assess the impact of knowledge-augmented generation. **(B)** Construction of benchmark datasets from DrugMechDB, a curated biomedical knowledge graph of drug–disease mechanisms. Directed paths connecting a drug to a disease were mined and transformed into structured questions targeting different mechanistic facets: (i) gene nodes (Mechanistic Gene Benchmark), (ii) biochemical entities or metabolites (Metabolite Benchmark), and (iii) drug–biological process–disease paths (Drug Benchmark). Each benchmark provides paired questions and gold-standard labels for rigorous, domain-specific evaluation of retrieval-augmented generation.

**Figure 2:. Retrieval-augmented generation with BTE-RAG markedly improves factual accuracy of gene-centric benchmark using GPT-4o models.**
**(A)** For the compact gpt-4o-mini model, introducing the BTE-RAG retrieval layer raised overall accuracy from 51% (hatched bar, LLM-only baseline) to 75.8 % (solid bar). **(B)** The same intervention applied to the larger gpt-4o model increased accuracy from 69.8% to 78.6 %. Accuracy was calculated as the proportion of correct answers across the composite biomedical question-answering benchmark described in Methods.

**Figure 3:. Retrieval-augmented context increases semantic concordance with ground-truth metabolites.**
**(A)** Cosine-similarity scores between each generated answer and the corresponding reference metabolite (sentence-transformer embeddings; see Methods) are plotted for all 201 questions in the Metabolite Benchmark, ordered from lowest to highest similarity. Dashed traces represent the LLM-only baseline, whereas solid traces include BioThings Explorer (BTE) retrieval-augmented context. Orange curves denote gpt-4o-mini; blue curves denote gpt-4o. For both model sizes, BTE-RAG systematically shifts the similarity distribution upward, indicating improved semantic alignment with the curated biochemical ground truth. **(B)** Score distribution GPT-4o, LLM-only. Histogram of cosine-similarity scores for GPT-4o answers generated without external context. Bar heights and numeric labels denote the number of questions (n = 201) falling in each bin; the overlaid KDE line summarizes the distribution. **(C)** Score distribution GPT-4o + BTE-RAG. Same format as panel B but for GPT-4o answers generated with BTE-RAG’s context. The right-shifted, more peaked distribution highlights the improvement in semantic alignment achieved by retrieval-augmented generation.

**Figure 4:. Retrieval-augmented generation maintains overall parity yet excels in the high-fidelity regime of drug-centric mechanistic answers.**
**(A)** Cosine-similarity scores (sentence-transformer embeddings; see Methods) between each generated answer and the reference drug→biological-process pathway are plotted for all 842 questions in the Drug Benchmark, ordered from lowest to highest similarity. Dashed traces (LLM-only) and solid traces (BTE-RAG) follow nearly overlapping trajectories across most of the distribution, indicating broadly comparable performance between the two inference modes. However, above a cosine similarity threshold of ≈ 0.7, both *gpt-4o-mini* (orange) and *gpt-4o* (blue) curves generated with BTE context surge ahead of their prompt-only counterparts, revealing a marked advantage in producing highly concordant mechanistic explanations. **(B)** Score distribution GPT-4o, LLM-only. Histogram of cosine-similarity scores for GPT-4o answers generated without external context. The hatched bar at 0.90–1.00 marks the high-fidelity zone, capturing 129 near-perfect matches produced by the baseline model. **(C)** Score distribution GPT-4o + BTE-RAG. Same format as panel B but for GPT-4o answers produced with BTE-RAG’s context. The distribution is right-shifted, and the solid bar in the 0.90–1.00 high-fidelity zone now contains 144 answers, highlighting the enrichment of top-tier mechanistic concordance achieved through retrieval-augmented generation.

See this image and copyright information in PMC

References

1. Hou W. & Ji Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nature Methods 2024 21:8 21, 1462–1465 (2024). - PMC - PubMed
1. Rives A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A 118, e2016239118 (2021). - PMC - PubMed
1. Lin Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (1979) 379, 1123–1130 (2023). - PubMed
1. Meier J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. in Proceedings of the 35th International Conference on Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY, USA, 2021).
1. Zheng Y. et al. Large language models in drug discovery and development: From disease mechanisms to clinical trials. arxiv.orgY Zheng, HY Koh, M Yang, L Li, LT May, GI Webb, S Pan, G ChurcharXiv preprint arXiv:2409.04481, 2024•arxiv.org.

Publication types

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Cold Spring Harbor Laboratory
- PubMed Central

[1] Hou W. & Ji Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nature Methods 2024 21:8 21, 1462–1465 (2024). - PMC - PubMed

[2] Hou W. & Ji Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nature Methods 2024 21:8 21, 1462–1465 (2024). - PMC - PubMed

[3] Rives A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A 118, e2016239118 (2021). - PMC - PubMed

[4] Rives A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A 118, e2016239118 (2021). - PMC - PubMed

[5] Lin Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (1979) 379, 1123–1130 (2023). - PubMed

[6] Lin Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (1979) 379, 1123–1130 (2023). - PubMed

[7] Meier J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. in Proceedings of the 35th International Conference on Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY, USA, 2021).

[8] Meier J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. in Proceedings of the 35th International Conference on Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY, USA, 2021).

[9] Zheng Y. et al. Large language models in drug discovery and development: From disease mechanisms to clinical trials. arxiv.orgY Zheng, HY Koh, M Yang, L Li, LT May, GI Webb, S Pan, G ChurcharXiv preprint arXiv:2409.04481, 2024•arxiv.org.

[10] Zheng Y. et al. Large language models in drug discovery and development: From disease mechanisms to clinical trials. arxiv.orgY Zheng, HY Koh, M Yang, L Li, LT May, GI Webb, S Pan, G ChurcharXiv preprint arXiv:2409.04481, 2024•arxiv.org.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Federated Knowledge Retrieval Elevates Large Language Model Performance on Biomedical Benchmarks

Affiliation

Federated Knowledge Retrieval Elevates Large Language Model Performance on Biomedical Benchmarks

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

This is a preprint.

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Related information

Grants and funding

LinkOut - more resources

Full Text Sources