This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Oct 30:rs.3.rs-3483777.

doi: 10.21203/rs.3.rs-3483777/v1.

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

Dave Van Veen^{1

2}, Cara Van Uden^{2

3}, Louis Blankemeier^{1

2}, Jean-Benoit Delbrouck², Asad Aali⁴, Christian Bluethgen^{5

6}, Anuj Pareek^{2

7}, Malgorzata Polacin^{5

6}, Eduardo Pontes Reis^{2

8}, Anna Seehofnerová^{5

9}, Nidhi Rohatgi^{5

10}, Poonam Hosamani⁵, William Collins⁵, Neera Ahuja⁵, Curtis P Langlotz^{2

5

9

11}, Jason Hom⁵, Sergios Gatidis^{2

9}, John Pauly¹, Akshay S Chaudhari^{2

9

11}

Affiliations

¹ Department of Electrical Engineering, Stanford University, Stanford, CA, USA.
² Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA.
³ Department of Computer Science, Stanford University, Stanford, CA, USA.
⁴ Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.
⁵ Department of Medicine, Stanford, CA, USA.
⁶ University Hospital Zurich, Zurich, Switzerland.
⁷ Copenhagen University Hospital, Copenhagen, Denmark.
⁸ Albert Einstein Israelite Hospital, São Paulo, Brazil.
⁹ Department of Radiology, Stanford University, Stanford, CA, USA.
¹⁰ Department of Neurosurgery, Stanford University, Stanford, CA, USA.
¹¹ Department of Biomedical Data Science, Stanford, CA, USA.

PMID: 37961377
PMCID: PMC10635391
DOI: 10.21203/rs.3.rs-3483777/v1

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

Dave Van Veen et al. Res Sq. 2023.

[Preprint]. 2023 Oct 30:rs.3.rs-3483777.

doi: 10.21203/rs.3.rs-3483777/v1.

Authors

Affiliations

¹ Department of Electrical Engineering, Stanford University, Stanford, CA, USA.
² Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA.
³ Department of Computer Science, Stanford University, Stanford, CA, USA.
⁴ Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.
⁵ Department of Medicine, Stanford, CA, USA.
⁶ University Hospital Zurich, Zurich, Switzerland.
⁷ Copenhagen University Hospital, Copenhagen, Denmark.
⁸ Albert Einstein Israelite Hospital, São Paulo, Brazil.
⁹ Department of Radiology, Stanford University, Stanford, CA, USA.
¹⁰ Department of Neurosurgery, Stanford University, Stanford, CA, USA.
¹¹ Department of Biomedical Data Science, Stanford, CA, USA.

PMID: 37961377
PMCID: PMC10635391
DOI: 10.21203/rs.3.rs-3483777/v1

Update in

Adapted large language models can outperform medical experts in clinical text summarization.
Van Veen D, Van Uden C, Blankemeier L, Delbrouck JB, Aali A, Bluethgen C, Pareek A, Polacin M, Reis EP, Seehofnerová A, Rohatgi N, Hosamani P, Collins W, Ahuja N, Langlotz CP, Hom J, Gatidis S, Pauly J, Chaudhari AS. Van Veen D, et al. Nat Med. 2024 Apr;30(4):1134-1142. doi: 10.1038/s41591-024-02855-5. Epub 2024 Feb 27. Nat Med. 2024. PMID: 38413730 Free PMC article.

Abstract

Sifting through vast textual data and summarizing key information from electronic health records (EHR) imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown immense promise in natural language processing (NLP) tasks, their efficacy on a diverse range of clinical summarization tasks has not yet been rigorously demonstrated. In this work, we apply domain adaptation methods to eight LLMs, spanning six datasets and four distinct clinical summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Our thorough quantitative assessment reveals trade-offs between models and adaptation methods in addition to instances where recent advances in LLMs may not improve results. Further, in a clinical reader study with ten physicians, we show that summaries from our best-adapted LLMs are preferable to human summaries in terms of completeness and correctness. Our ensuing qualitative analysis highlights challenges faced by both LLMs and human experts. Lastly, we correlate traditional quantitative NLP metrics with reader study scores to enhance our understanding of how these metrics align with physician preferences. Our research marks the first evidence of LLMs outperforming human experts in clinical text summarization across multiple tasks. This implies that integrating LLMs into clinical workflows could alleviate documentation burden, empowering clinicians to focus more on personalized patient care and the inherently human aspects of medicine.

PubMed Disclaimer

Figures

**Figure A1 |**
Comparing Llama-2 (7B) vs. Llama-2 (13B). The dashed line denotes equivalence, and each data point corresponds to the average score of s = 250 samples for a given experimental configuration, i.e. {dataset × m in-context examples}.

**Figure A2 |**
Summarization performance comparing one in-context example (ICL) vs. QLoRA across all open-source models on patient health questions. Figure 3b contains similar results with the Open-i radiology report dataset.

**Figure A3 |**
Metric scores vs. number of in-context examples across models and datasets. We also include the best model fine-tuned with QLoRA (FLAN-T5) as a horizontal dashed line.

**Figure A4 |**
Annotation of two patient health question examples from the reader study. The table (lower left) contains reader scores for these two examples and the task average across all samples.

**Figure A5 |**
Annotation of a progress notes summarization example evaluated in the reader study. The table (lower right) contains reader scores for this example and the task average across all samples.

**Figure A6 |**
Annotation of a progress notes summarization example evaluated in the reader study. The table (lower right) contains reader scores for this example and the task average across all samples.

**Figure A7 |**
Example of the doctor-patient dialogue summarization task, including “assessment and plan” sections generated by both a human expert and GPT-4.

**Figure 1 |**
Overview. First we quantitatively evaluate each valid combination (×) of LLM and adaptation method across four distinct summarization tasks comprising six datasets. We then conduct a clinical reader study in which ten physicians compare summaries of the best model/method against those of a human expert.

**Figure 2 |**
Prompt anatomy. Each summarization task uses a slightly different instruction, as depicted in Table A1.

**Figure 3 |. Quantitative results.**
**(a)** Alpaca vs. Med-Alpaca. Each data point corresponds to one experimental configuration, and the dashed lines denote equal performance. **(b)** One in-context example (ICL) vs. QLoRA methods across all open-source models on the Open-i radiology report dataset. **(c)** MEDCON scores vs. number of in-context examples across models and datasets. We also include the best model fine-tuned with QLoRA as a horizontal dashed line for valid datasets. See Figure A3 for results across all four metrics.(d) Model win rate: a head-to-head winning percentage of each model combination, where red/blue intensities highlight the degree to which models on the vertical axis outperform models on the horizontal axis.

**Figure 4 |. Clinical reader study.**
**(a)** Study design comparing the summarization of GPT-4 vs. that of human experts on three attributes: completeness, correctness, and conciseness. **(b)** Results. Highlight colors correspond to a value’s location on the color spectrum. Asterisks denote statistical significance by Wilcoxon signed-rank test, *p-value < 0.001. **(c)** Reader study user interface. **(d)** Distribution of reader scores for each summarization task across attributes. Horizontal axes denote reader preference as measured by a five-point Likert scale. Vertical axes denote frequency count, with 1,500 total reports for each plot.

**Figure 5 |**
Annotation of two radiologist report examples from the reader study. The table (lower left) contains reader scores for these two examples and the task average across all samples.

**Figure 6 |**
Spearman correlation coefficients between NLP metrics and reader preference assessing completeness, correctness, and conciseness.

See this image and copyright information in PMC

References

1. Abacha A. B., Yim W. w., Adams G., Snider N. & Yetisgen-Yildiz M. Overview of the MEDIQA-Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations in Proceedings of the 5th Clinical Natural Language Processing Workshop (2023), 503–513.
1. Agarwal N., Moehring A., Rajpurkar P. & Salz T. Combining human expertise with artificial intelligence: experimental evidence from Radiology tech. rep. (National Bureau of Economic Research, 2023).
1. Arndt B. G., Beasley J. W., Watkinson M. D., Temte J. L., Tuan W.-J., Sinsky C. A. & Gilchrist V. J. Tethered to the EHR: primary care physician workload assessment using EHR event log data and time-motion observations. The Annals of Family Medicine 15, 419–426 (2017). - PMC - PubMed
1. Ben Abacha A. & Demner-Fushman D. On the Summarization of Consumer Health Questions in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28th - August 2 (2019).
1. Best Practices for Prompt Engineering with OpenAI API https://help.openai.com/en/articles/6654000-best-practices-fo-rprompt-en.... Accessed: 2023-09-08. OpenAI, 2023.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

Affiliations

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources