Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Oct 30:rs.3.rs-3483777.
doi: 10.21203/rs.3.rs-3483777/v1.

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

Affiliations

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

Dave Van Veen et al. Res Sq. .

Update in

Abstract

Sifting through vast textual data and summarizing key information from electronic health records (EHR) imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown immense promise in natural language processing (NLP) tasks, their efficacy on a diverse range of clinical summarization tasks has not yet been rigorously demonstrated. In this work, we apply domain adaptation methods to eight LLMs, spanning six datasets and four distinct clinical summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Our thorough quantitative assessment reveals trade-offs between models and adaptation methods in addition to instances where recent advances in LLMs may not improve results. Further, in a clinical reader study with ten physicians, we show that summaries from our best-adapted LLMs are preferable to human summaries in terms of completeness and correctness. Our ensuing qualitative analysis highlights challenges faced by both LLMs and human experts. Lastly, we correlate traditional quantitative NLP metrics with reader study scores to enhance our understanding of how these metrics align with physician preferences. Our research marks the first evidence of LLMs outperforming human experts in clinical text summarization across multiple tasks. This implies that integrating LLMs into clinical workflows could alleviate documentation burden, empowering clinicians to focus more on personalized patient care and the inherently human aspects of medicine.

PubMed Disclaimer

Figures

Figure A1 |
Figure A1 |
Comparing Llama-2 (7B) vs. Llama-2 (13B). The dashed line denotes equivalence, and each data point corresponds to the average score of s = 250 samples for a given experimental configuration, i.e. {dataset × m in-context examples}.
Figure A2 |
Figure A2 |
Summarization performance comparing one in-context example (ICL) vs. QLoRA across all open-source models on patient health questions. Figure 3b contains similar results with the Open-i radiology report dataset.
Figure A3 |
Figure A3 |
Metric scores vs. number of in-context examples across models and datasets. We also include the best model fine-tuned with QLoRA (FLAN-T5) as a horizontal dashed line.
Figure A4 |
Figure A4 |
Annotation of two patient health question examples from the reader study. The table (lower left) contains reader scores for these two examples and the task average across all samples.
Figure A5 |
Figure A5 |
Annotation of a progress notes summarization example evaluated in the reader study. The table (lower right) contains reader scores for this example and the task average across all samples.
Figure A6 |
Figure A6 |
Annotation of a progress notes summarization example evaluated in the reader study. The table (lower right) contains reader scores for this example and the task average across all samples.
Figure A7 |
Figure A7 |
Example of the doctor-patient dialogue summarization task, including “assessment and plan” sections generated by both a human expert and GPT-4.
Figure 1 |
Figure 1 |
Overview. First we quantitatively evaluate each valid combination (×) of LLM and adaptation method across four distinct summarization tasks comprising six datasets. We then conduct a clinical reader study in which ten physicians compare summaries of the best model/method against those of a human expert.
Figure 2 |
Figure 2 |
Prompt anatomy. Each summarization task uses a slightly different instruction, as depicted in Table A1.
Figure 3 |
Figure 3 |. Quantitative results.
(a) Alpaca vs. Med-Alpaca. Each data point corresponds to one experimental configuration, and the dashed lines denote equal performance. (b) One in-context example (ICL) vs. QLoRA methods across all open-source models on the Open-i radiology report dataset. (c) MEDCON scores vs. number of in-context examples across models and datasets. We also include the best model fine-tuned with QLoRA as a horizontal dashed line for valid datasets. See Figure A3 for results across all four metrics.(d) Model win rate: a head-to-head winning percentage of each model combination, where red/blue intensities highlight the degree to which models on the vertical axis outperform models on the horizontal axis.
Figure 4 |
Figure 4 |. Clinical reader study.
(a) Study design comparing the summarization of GPT-4 vs. that of human experts on three attributes: completeness, correctness, and conciseness. (b) Results. Highlight colors correspond to a value’s location on the color spectrum. Asterisks denote statistical significance by Wilcoxon signed-rank test, *p-value < 0.001. (c) Reader study user interface. (d) Distribution of reader scores for each summarization task across attributes. Horizontal axes denote reader preference as measured by a five-point Likert scale. Vertical axes denote frequency count, with 1,500 total reports for each plot.
Figure 5 |
Figure 5 |
Annotation of two radiologist report examples from the reader study. The table (lower left) contains reader scores for these two examples and the task average across all samples.
Figure 6 |
Figure 6 |
Spearman correlation coefficients between NLP metrics and reader preference assessing completeness, correctness, and conciseness.

References

    1. Abacha A. B., Yim W. w., Adams G., Snider N. & Yetisgen-Yildiz M. Overview of the MEDIQA-Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations in Proceedings of the 5th Clinical Natural Language Processing Workshop (2023), 503–513.
    1. Agarwal N., Moehring A., Rajpurkar P. & Salz T. Combining human expertise with artificial intelligence: experimental evidence from Radiology tech. rep. (National Bureau of Economic Research, 2023).
    1. Arndt B. G., Beasley J. W., Watkinson M. D., Temte J. L., Tuan W.-J., Sinsky C. A. & Gilchrist V. J. Tethered to the EHR: primary care physician workload assessment using EHR event log data and time-motion observations. The Annals of Family Medicine 15, 419–426 (2017). - PMC - PubMed
    1. Ben Abacha A. & Demner-Fushman D. On the Summarization of Consumer Health Questions in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28th - August 2 (2019).
    1. Best Practices for Prompt Engineering with OpenAI API https://help.openai.com/en/articles/6654000-best-practices-fo-rprompt-en.... Accessed: 2023-09-08. OpenAI, 2023.

Publication types

LinkOut - more resources