This is a preprint.
Evaluating Large Language Models on Medical Evidence Summarization
- PMID: 37162998
- PMCID: PMC10168498
- DOI: 10.1101/2023.04.22.23288967
Evaluating Large Language Models on Medical Evidence Summarization
Update in
-
Evaluating large language models on medical evidence summarization.NPJ Digit Med. 2023 Aug 24;6(1):158. doi: 10.1038/s41746-023-00896-7. NPJ Digit Med. 2023. PMID: 37620423 Free PMC article.
Abstract
Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study has demonstrated that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.
Figures
References
-
- Chowdhery A., Narang S., Devlin J., Bosma M., Mishra G., Roberts A., Barham P., Chung H.W., Sutton C., Gehrmann S., Schuh P., Shi K., Tsvyashchenko S., Maynez J., Rao A., Barnes P., Tay Y., Shazeer N., Prab-hakaran V., Reif E., Du N., Hutchinson B., Pope R., Bradbury J., Austin J., Isard M., Gur-Ari G., Yin P., Duke T., Levskaya A., Ghemawat S., Dev S., Michalewski H., Garcia X., Misra V., Robinson K., Fedus L., Zhou D., Ippolito D., Luan D., Lim H., Zoph B., Spiridonov A., Sepassi R., Dohan D., Agrawal S., Omernick M., Dai A.M., Pillai T.S., Pellat M., Lewkowycz A., Moreira E., Child R., Polozov O., Lee K., Zhou Z., Wang X., Saeta B., Diaz M., Firat O., Catasta M., Wei J., Meier-Hellstern K., Eck D., Dean J., Petrov S., Fiedel N.: PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022) arXiv:2204.02311 [cs.CL]
-
- Wei J., Wang X., Schuurmans D., Bosma M., Ichter B., Xia F., Chi E., Le Q., Zhou D.: Chain-of-thought prompting elicits reasoning in large language models. In: Koyejo S., Mohamed S., Agarwal A., Belgrave D., Cho K., Oh A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837. Curran Associates, Inc., ??? (2022)
-
- Kojima T., Gu S.s., Reid M., Matsuo Y., Iwasawa Y.: Large language models are Zero-Shot reasoners. In: Koyejo S., Mohamed S., Agarwal A., Belgrave D., Cho K., Oh A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 22199–22213. Curran Associates, Inc., ??? (2022)
-
- Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., Agarwal S., Herbert-Voss A., Krueger G., Henighan T., Child R., Ramesh A., Ziegler D., Wu J., Winter C., Hesse C., Chen M., Sigler E., Litwin M., Gray S., Chess B., Clark J., Berner C., McCandlish S., Radford A., Sutskever I., Amodei D.: Language models are Few-Shot learners. In: Larochelle H., Ranzato M., Hadsell R., Balcan M.F., Lin H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc., ??? (2020)
-
- Ouyang L., Wu J., Jiang X., Almeida D., Wainwright C., Mishkin P., Zhang C., Agarwal S., Slama K., Ray A., Schulman J., Hilton J., Kelton F., Miller L., Simens M., Askell A., Welinder P., Christiano P.F., Leike J., Lowe R.: Training language models to follow instructions with human feedback. In: Koyejo S., Mohamed S., Agarwal A., Belgrave D., Cho K., Oh A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744. Curran Associates, Inc., ??? (2022)
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources