This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 May 2:rs.3.rs-2883198.

doi: 10.21203/rs.3.rs-2883198/v1.

Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Cyril Zakka¹, Akash Chaurasia^{1

2}, Rohan Shad³, Alex R Dalal¹, Jennifer L Kim¹, Michael Moor², Kevin Alexander⁴, Euan Ashley⁴, Jack Boyd¹, Kathleen Boyd⁵, Karen Hirsch⁶, Curt Langlotz⁷, Joanna Nelson⁸, William Hiesinger¹

Affiliations

¹ Department of Cardiothoracic Surgery, Stanford Medicine.
² Department of Computer Science, Stanford University.
³ Division of Cardiovascular Surgery, Penn Medicine.
⁴ Division of Cardiovascular Medicine, Stanford Medicine.
⁵ Department of Pediatrics, Stanford Medicine.
⁶ Department of Neurology, Stanford Medicine.s.
⁷ Department of Radiology and Biomedical Informatics, Stanford Medicine.
⁸ Division of Infectious Diseases, Stanford Medicine.

PMID: 37205549
PMCID: PMC10187428
DOI: 10.21203/rs.3.rs-2883198/v1

Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Cyril Zakka et al. Res Sq. 2023.

[Preprint]. 2023 May 2:rs.3.rs-2883198.

doi: 10.21203/rs.3.rs-2883198/v1.

Authors

Affiliations

¹ Department of Cardiothoracic Surgery, Stanford Medicine.
² Department of Computer Science, Stanford University.
³ Division of Cardiovascular Surgery, Penn Medicine.
⁴ Division of Cardiovascular Medicine, Stanford Medicine.
⁵ Department of Pediatrics, Stanford Medicine.
⁶ Department of Neurology, Stanford Medicine.s.
⁷ Department of Radiology and Biomedical Informatics, Stanford Medicine.
⁸ Division of Infectious Diseases, Stanford Medicine.

PMID: 37205549
PMCID: PMC10187428
DOI: 10.21203/rs.3.rs-2883198/v1

Update in

Almanac - Retrieval-Augmented Language Models for Clinical Medicine.
Zakka C, Shad R, Chaurasia A, Dalal AR, Kim JL, Moor M, Fong R, Phillips C, Alexander K, Ashley E, Boyd J, Boyd K, Hirsch K, Langlotz C, Lee R, Melia J, Nelson J, Sallam K, Tullis S, Vogelsong MA, Cunningham JP, Hiesinger W. Zakka C, et al. NEJM AI. 2024 Feb;1(2):10.1056/aioa2300068. doi: 10.1056/aioa2300068. Epub 2024 Jan 25. NEJM AI. 2024. PMID: 38343631 Free PMC article.

Abstract

Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In this study, we develop Almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. Performance on a novel dataset of clinical scenarios (n= 130) evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality (mean of 18% at p-value < 0.05) across all specialties, with improvements in completeness and safety. Our results demonstrate the potential for large language models to be effective tools in the clinical decision-making process, while also emphasizing the importance of careful testing and deployment to mitigate their shortcomings.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests.

Figures

**Fig. B1. Adversarial Performanc e Overview**
With adversarial prompts, Almanac proves more robust to ChatGPT due to retriever scoring which matches a query to a given passage. The effectiveness of this approach is inversely correlated to the word count of the adversarial prompt.

**Fig. 1. Almanac Overview**
When presented with a query, Almanac first uses external tools to retrieve relevant information before synthesizing a response with citations referencing source material. With this framework, LLM outputs remain grounded in truth, while providing a reliable way of fact-checking their outputs.

**Fig. 2. ClinicalQA Performance**
Comparison of performances between Almanac and ChatGPT on the ClinicalQA dataset as evaluated by physicians. Almanac outperforms its counterpart with significant gains in factuality, and marginal improvements in completeness. Although more robust to adversarial prompts, Almanac and ChatGPT both exhibit hallucinations with omission. Despite these performances, ChatGPT answers are preferred 57% of the time. Error bars shown visualize standard error (SE)

**Fig. 3. Output Comparison**
Comparison between Almanac (top) and ChatGPT (bottom) for a given medical query. With access to a calculator and the retrieved rubric for CHA2DS2-VASc, Almanac is able to correctly respond to clinical vignette in comparison to ChatGPT. Sources are removed for illustrative purposes.

See this image and copyright information in PMC

References

1. Brown T.B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., Agarwal S., Herbert-Voss A., Krueger G., Henighan T., Child R., Ramesh A., Ziegler D.M., Wu J., Winter C., Hesse C., Chen M., Sigler E., Litwin M., Gray S., Chess B., Clark J., Berner C., McCandlish S., Radford A., Sutskever I., Amodei D.: Language Models are Few-Shot Learners. arXiv (2020). 10.48550/ARXIV.2005.14165. https://arxiv.org/abs/2005.14165 - DOI
1. Chen M., Tworek J., Jun H., Yuan Q., de Oliveira Pinto H.P., Kaplan J., Edwards H., Burda Y., Joseph N., Brockman G., Ray A., Puri R., Krueger G., Petrov M., Khlaaf H., Sastry G., Mishkin P., Chan B., Gray S., Ryder N., Pavlov M., Power A., Kaiser L., Bavarian M., Winter C., Tillet P., Such F.P., Cummings D., Plappert M., Chantzis F., Barnes E., Herbert-Voss A., Guss W.H., Nichol A., Paino A., Tezak N., Tang J., Babuschkin I., Balaji S., Jain S., Saunders W., Hesse C., Carr A.N., Leike J., Achiam J., Misra V., Morikawa E., Radford A., Knight M., Brundage M., Murati M., Mayer K., Welinder P., McGrew B., Amodei D., McCandlish S., Sutskever I., Zaremba W.: Evaluating large language models trained on code. CoRR abs/2107.03374 (2021) 2107.03374
1. Wei C., Xie S.M., Ma T.: Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning. arXiv (2021). 10.48550/ARXIV.2106.09226. https://arxiv.org/abs/2106.09226 - DOI
1. Devlin J., Chang M.-W., Lee K., Toutanova K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv (2018). 10.48550/ARXIV.1810.04805. https://arxiv.org/abs/1810.04805 - DOI
1. Wei J., Tay Y., Bommasani R., Raffel C., Zoph B., Borgeaud S., Yogatama D., Bosma M., Zhou D., Metzler D., Chi E.H., Hashimoto T., Vinyals O., Liang P., Dean J., Fedus W.: Emergent Abilities of Large Language Models. arXiv (2022). 10.48550/ARXIV.2206.07682. https://arxiv.org/abs/2206.07682 - DOI

Publication types

Actions

Grants and funding

R01 HL157235/HL/NHLBI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Affiliations

Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources