Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 24;6(1):158.
doi: 10.1038/s41746-023-00896-7.

Evaluating large language models on medical evidence summarization

Affiliations

Evaluating large language models on medical evidence summarization

Liyan Tang et al. NPJ Digit Med. .

Abstract

Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study demonstrates that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Performance of different summarization systems in automatic evaluations.
a Reference-based metrics (higher scores indicate better summaries). b Extractiveness metrics.
Fig. 2
Fig. 2. Performance of different summarization systems in human evaluations.
a Coherence, b factual consistency, c comprehensiveness, and d harmfulness. Statistical analysis by Mann–Whitney U test, *p-value ≤ 0.05, **p-value ≤ 0.01, ***p-value ≤ 0.001, ****p-value ≤ 0.0001.
Fig. 3
Fig. 3. Annotator vote distribution across all clinical domains and models.
a The most and least preferred summaries. b The reasons for choosing the most preferred summaries. c The reasons for choosing the least preferred summaries.

Update of

References

    1. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 24824–24837 (Curran Associates, Inc., 2022).
    1. Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. & Lin, H.) 1877–1901 (Curran Associates, Inc., 2020).
    1. Chowdhery, A. et al. PaLM: scaling language modeling with pathways. Preprint at https://arxiv.org/abs/2204.02311 (2022).
    1. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 22199–22213 (Curran Associates, Inc., 2022).
    1. Ouyang, L. et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 27730–27744 (Curran Associates, Inc., 2022).