This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Apr 24:2023.04.22.23288967.

doi: 10.1101/2023.04.22.23288967.

Evaluating Large Language Models on Medical Evidence Summarization

Liyan Tang¹, Zhaoyi Sun², Betina Idnay³, Jordan G Nestor⁴, Ali Soroush³, Pierre A Elias³, Ziyang Xu⁵, Ying Ding¹, Greg Durrett⁶, Justin Rousseau⁷, Chunhua Weng³, Yifan Peng²

Affiliations

¹ School of Information, The University of Texas at Austin, Austin, TX.
² Department of Population Health Sciences, Weill Cornell Medicine, New York, NY.
³ Department of Biomedical Informatics, Columbia University, New York, NY.
⁴ Department of Medicine, Columbia University, New York, NY.
⁵ Department of Medicine, Massachusetts General Hospital, Boston, MA.
⁶ Department of Computer Science, The University of Texas at Austin, Austin, TX.
⁷ Departments of Population Health and Neurology, Dell Medical School, The University of Texas at Austin, Austin, TX.

PMID: 37162998
PMCID: PMC10168498
DOI: 10.1101/2023.04.22.23288967

Evaluating Large Language Models on Medical Evidence Summarization

Liyan Tang et al. medRxiv. 2023.

[Preprint]. 2023 Apr 24:2023.04.22.23288967.

doi: 10.1101/2023.04.22.23288967.

Authors

Liyan Tang¹, Zhaoyi Sun², Betina Idnay³, Jordan G Nestor⁴, Ali Soroush³, Pierre A Elias³, Ziyang Xu⁵, Ying Ding¹, Greg Durrett⁶, Justin Rousseau⁷, Chunhua Weng³, Yifan Peng²

Affiliations

¹ School of Information, The University of Texas at Austin, Austin, TX.
² Department of Population Health Sciences, Weill Cornell Medicine, New York, NY.
³ Department of Biomedical Informatics, Columbia University, New York, NY.
⁴ Department of Medicine, Columbia University, New York, NY.
⁵ Department of Medicine, Massachusetts General Hospital, Boston, MA.
⁶ Department of Computer Science, The University of Texas at Austin, Austin, TX.
⁷ Departments of Population Health and Neurology, Dell Medical School, The University of Texas at Austin, Austin, TX.

PMID: 37162998
PMCID: PMC10168498
DOI: 10.1101/2023.04.22.23288967

Update in

Evaluating large language models on medical evidence summarization.
Tang L, Sun Z, Idnay B, Nestor JG, Soroush A, Elias PA, Xu Z, Ding Y, Durrett G, Rousseau JF, Weng C, Peng Y. Tang L, et al. NPJ Digit Med. 2023 Aug 24;6(1):158. doi: 10.1038/s41746-023-00896-7. NPJ Digit Med. 2023. PMID: 37620423 Free PMC article.

Abstract

Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study has demonstrated that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.

PubMed Disclaimer

Figures

**Fig. 1:**
Performance of different summarization systems in automatic and human evaluations. (A) Reference-based Metrics (higher scores indicate better summaries). (B) Extractiveness Metrics. (C) Coherence. (D) Factual Consistency. (E) Comprehensiveness. (F) Harmfulness. Statistical analysis by Mann-Whitney U test (**C-F**), *p-value ≤ 0.05, **p-value ≤ 0.01, ***p-value ≤ 0.001, ****p-value ≤ 0.0001.

**Fig. 2:**
Annotator vote distribution for the most and least preferred summaries (A) and the reasons for choosing them (B and C) across all clinical domains and models.

See this image and copyright information in PMC

References

1. Chowdhery A., Narang S., Devlin J., Bosma M., Mishra G., Roberts A., Barham P., Chung H.W., Sutton C., Gehrmann S., Schuh P., Shi K., Tsvyashchenko S., Maynez J., Rao A., Barnes P., Tay Y., Shazeer N., Prab-hakaran V., Reif E., Du N., Hutchinson B., Pope R., Bradbury J., Austin J., Isard M., Gur-Ari G., Yin P., Duke T., Levskaya A., Ghemawat S., Dev S., Michalewski H., Garcia X., Misra V., Robinson K., Fedus L., Zhou D., Ippolito D., Luan D., Lim H., Zoph B., Spiridonov A., Sepassi R., Dohan D., Agrawal S., Omernick M., Dai A.M., Pillai T.S., Pellat M., Lewkowycz A., Moreira E., Child R., Polozov O., Lee K., Zhou Z., Wang X., Saeta B., Diaz M., Firat O., Catasta M., Wei J., Meier-Hellstern K., Eck D., Dean J., Petrov S., Fiedel N.: PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022) arXiv:2204.02311 [cs.CL]
1. Wei J., Wang X., Schuurmans D., Bosma M., Ichter B., Xia F., Chi E., Le Q., Zhou D.: Chain-of-thought prompting elicits reasoning in large language models. In: Koyejo S., Mohamed S., Agarwal A., Belgrave D., Cho K., Oh A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837. Curran Associates, Inc., ??? (2022)
1. Kojima T., Gu S.s., Reid M., Matsuo Y., Iwasawa Y.: Large language models are Zero-Shot reasoners. In: Koyejo S., Mohamed S., Agarwal A., Belgrave D., Cho K., Oh A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 22199–22213. Curran Associates, Inc., ??? (2022)
1. Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., Agarwal S., Herbert-Voss A., Krueger G., Henighan T., Child R., Ramesh A., Ziegler D., Wu J., Winter C., Hesse C., Chen M., Sigler E., Litwin M., Gray S., Chess B., Clark J., Berner C., McCandlish S., Radford A., Sutskever I., Amodei D.: Language models are Few-Shot learners. In: Larochelle H., Ranzato M., Hadsell R., Balcan M.F., Lin H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc., ??? (2020)
1. Ouyang L., Wu J., Jiang X., Almeida D., Wainwright C., Mishkin P., Zhang C., Agarwal S., Slama K., Ray A., Schulman J., Hilton J., Kelton F., Miller L., Simens M., Askell A., Welinder P., Christiano P.F., Leike J., Lowe R.: Training language models to follow instructions with human feedback. In: Koyejo S., Mohamed S., Agarwal A., Belgrave D., Cho K., Oh A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744. Curran Associates, Inc., ??? (2022)

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Evaluating Large Language Models on Medical Evidence Summarization

Affiliations

Evaluating Large Language Models on Medical Evidence Summarization

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources