Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

Affiliations

¹ Translational Data Sciences, Genmab, Princeton, New Jersey, United States of America.
² Data Sciences and AI, Genmab, Princeton, New Jersey, United States of America.
³ Commercial Data Sciences, Genmab, Princeton, New Jersey, United States of America.

PMID: 39167594
PMCID: PMC11338460
DOI: 10.1371/journal.pdig.0000568

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

David Soong et al. PLOS Digit Health. 2024.

. 2024 Aug 21;3(8):e0000568.

doi: 10.1371/journal.pdig.0000568. eCollection 2024 Aug.

Affiliations

¹ Translational Data Sciences, Genmab, Princeton, New Jersey, United States of America.
² Data Sciences and AI, Genmab, Princeton, New Jersey, United States of America.
³ Commercial Data Sciences, Genmab, Princeton, New Jersey, United States of America.

PMID: 39167594
PMCID: PMC11338460
DOI: 10.1371/journal.pdig.0000568

Abstract

Large language models (LLMs) have made a significant impact on the fields of general artificial intelligence. General purpose LLMs exhibit strong logic and reasoning skills and general world knowledge but can sometimes generate misleading results when prompted on specific subject areas. LLMs trained with domain-specific knowledge can reduce the generation of misleading information (i.e. hallucinations) and enhance the precision of LLMs in specialized contexts. Training new LLMs on specific corpora however can be resource intensive. Here we explored the use of a retrieval-augmented generation (RAG) model which we tested on literature specific to a biomedical research area. OpenAI's GPT-3.5, GPT-4, Microsoft's Prometheus, and a custom RAG model were used to answer 19 questions pertaining to diffuse large B-cell lymphoma (DLBCL) disease biology and treatment. Eight independent reviewers assessed LLM responses based on accuracy, relevance, and readability, rating responses on a 3-point scale for each category. These scores were then used to compare LLM performance. The performance of the LLMs varied across scoring categories. On accuracy and relevance, the RAG model outperformed other models with higher scores on average and the most top scores across questions. GPT-4 was more comparable to the RAG model on relevance versus accuracy. By the same measures, GPT-4 and GPT-3.5 had the highest scores for readability of answers when compared to the other LLMs. GPT-4 and 3.5 also had more answers with hallucinations than the other LLMs, due to non-existent references and inaccurate responses to clinical questions. Our findings suggest that an oncology research-focused RAG model may outperform general-purpose LLMs in accuracy and relevance when answering subject-related questions. This framework can be tailored to Q&A in other subject areas. Further research will help understand the impact of LLM architectures, RAG methodologies, and prompting techniques in answering questions across different subject areas.

Copyright: © 2024 Soong et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1**
Average (A) accuracy, (B) relevance, and (C) readability scores across 19 questions/queries answered, per LLM. Each bar represents the average score across 19 questions for the three metrics (error bars represent standard error of the mean).

**Fig 2**
Histogram of (A) accuracy, (B) relevance, and (C) readability scores for each LLM. Bars represent the count of 3-point, 2-point, and 1-point scores from each of the three reviewers for each of the 19 questions (57 scores in total per LLM).

See this image and copyright information in PMC

References

1. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al.. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877–901.
1. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2018 October 01, 2018:[arXiv:1810.04805 p.]. Available from: https://ui.adsabs.harvard.edu/abs/2018arXiv181004805D.
1. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al.. Language models are unsupervised multitask learners. OpenAI blog. 2019;1(8):9.
1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need2017 June 01, 2017:[arXiv:1706.03762 p.]. Available from: https://ui.adsabs.harvard.edu/abs/2017arXiv170603762V.
1. Chen M, Tworek J, Jun H, Yuan Q, Ponde de Oliveira Pinto H, Kaplan J, et al.. Evaluating Large Language Models Trained on Code2021 July 01, 2021:[arXiv:2107.03374 p.]. Available from: https://ui.adsabs.harvard.edu/abs/2021arXiv210703374C.

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

Affiliations

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources