Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 May 2:rs.3.rs-2883198.
doi: 10.21203/rs.3.rs-2883198/v1.

Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Affiliations

Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Cyril Zakka et al. Res Sq. .

Update in

  • Almanac - Retrieval-Augmented Language Models for Clinical Medicine.
    Zakka C, Shad R, Chaurasia A, Dalal AR, Kim JL, Moor M, Fong R, Phillips C, Alexander K, Ashley E, Boyd J, Boyd K, Hirsch K, Langlotz C, Lee R, Melia J, Nelson J, Sallam K, Tullis S, Vogelsong MA, Cunningham JP, Hiesinger W. Zakka C, et al. NEJM AI. 2024 Feb;1(2):10.1056/aioa2300068. doi: 10.1056/aioa2300068. Epub 2024 Jan 25. NEJM AI. 2024. PMID: 38343631 Free PMC article.

Abstract

Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In this study, we develop Almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. Performance on a novel dataset of clinical scenarios (n= 130) evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality (mean of 18% at p-value < 0.05) across all specialties, with improvements in completeness and safety. Our results demonstrate the potential for large language models to be effective tools in the clinical decision-making process, while also emphasizing the importance of careful testing and deployment to mitigate their shortcomings.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests.

Figures

Fig. B1
Fig. B1. Adversarial Performanc e Overview
With adversarial prompts, Almanac proves more robust to ChatGPT due to retriever scoring which matches a query to a given passage. The effectiveness of this approach is inversely correlated to the word count of the adversarial prompt.
Fig. 1
Fig. 1. Almanac Overview
When presented with a query, Almanac first uses external tools to retrieve relevant information before synthesizing a response with citations referencing source material. With this framework, LLM outputs remain grounded in truth, while providing a reliable way of fact-checking their outputs.
Fig. 2
Fig. 2. ClinicalQA Performance
Comparison of performances between Almanac and ChatGPT on the ClinicalQA dataset as evaluated by physicians. Almanac outperforms its counterpart with significant gains in factuality, and marginal improvements in completeness. Although more robust to adversarial prompts, Almanac and ChatGPT both exhibit hallucinations with omission. Despite these performances, ChatGPT answers are preferred 57% of the time. Error bars shown visualize standard error (SE)
Fig. 3
Fig. 3. Output Comparison
Comparison between Almanac (top) and ChatGPT (bottom) for a given medical query. With access to a calculator and the retrieved rubric for CHA2DS2-VASc, Almanac is able to correctly respond to clinical vignette in comparison to ChatGPT. Sources are removed for illustrative purposes.

References

    1. Brown T.B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., Agarwal S., Herbert-Voss A., Krueger G., Henighan T., Child R., Ramesh A., Ziegler D.M., Wu J., Winter C., Hesse C., Chen M., Sigler E., Litwin M., Gray S., Chess B., Clark J., Berner C., McCandlish S., Radford A., Sutskever I., Amodei D.: Language Models are Few-Shot Learners. arXiv (2020). 10.48550/ARXIV.2005.14165. https://arxiv.org/abs/2005.14165 - DOI
    1. Chen M., Tworek J., Jun H., Yuan Q., de Oliveira Pinto H.P., Kaplan J., Edwards H., Burda Y., Joseph N., Brockman G., Ray A., Puri R., Krueger G., Petrov M., Khlaaf H., Sastry G., Mishkin P., Chan B., Gray S., Ryder N., Pavlov M., Power A., Kaiser L., Bavarian M., Winter C., Tillet P., Such F.P., Cummings D., Plappert M., Chantzis F., Barnes E., Herbert-Voss A., Guss W.H., Nichol A., Paino A., Tezak N., Tang J., Babuschkin I., Balaji S., Jain S., Saunders W., Hesse C., Carr A.N., Leike J., Achiam J., Misra V., Morikawa E., Radford A., Knight M., Brundage M., Murati M., Mayer K., Welinder P., McGrew B., Amodei D., McCandlish S., Sutskever I., Zaremba W.: Evaluating large language models trained on code. CoRR abs/2107.03374 (2021) 2107.03374
    1. Wei C., Xie S.M., Ma T.: Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning. arXiv (2021). 10.48550/ARXIV.2106.09226. https://arxiv.org/abs/2106.09226 - DOI
    1. Devlin J., Chang M.-W., Lee K., Toutanova K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv (2018). 10.48550/ARXIV.1810.04805. https://arxiv.org/abs/1810.04805 - DOI
    1. Wei J., Tay Y., Bommasani R., Raffel C., Zoph B., Borgeaud S., Yogatama D., Bosma M., Zhou D., Metzler D., Chi E.H., Hashimoto T., Vinyals O., Liang P., Dean J., Fedus W.: Emergent Abilities of Large Language Models. arXiv (2022). 10.48550/ARXIV.2206.07682. https://arxiv.org/abs/2206.07682 - DOI

Publication types

LinkOut - more resources