A strategy for cost-effective large language model use at health system-scale

Eyal Klang^{1

2}, Donald Apakama^{3

4}, Ethan E Abbott^{5

3

4}, Akhil Vaid^{5

6}, Joshua Lampert^{5

6

7}, Ankit Sakhuja^{5

6}, Robert Freeman⁶, Alexander W Charney⁶, David Reich⁸, Monica Kraft⁹, Girish N Nadkarni^{10

11}, Benjamin S Glicksberg^{5

12}

Affiliations

¹ Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA. eyal.klang@mountsinai.org.
² The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA. eyal.klang@mountsinai.org.
³ Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁴ Institute for Health Equity Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁵ Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁶ The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁷ Mount Sinai Fuster Heart Hospital, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁸ Department of Anesthesiology, Perioperative and Pain Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁹ The Samuel Bronfman Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
¹⁰ Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA. girish.nadkarni@mountsinai.org.
¹¹ The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA. girish.nadkarni@mountsinai.org.
¹² Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.

PMID: 39558090
PMCID: PMC11574261
DOI: 10.1038/s41746-024-01315-1

A strategy for cost-effective large language model use at health system-scale

Eyal Klang et al. NPJ Digit Med. 2024.

. 2024 Nov 18;7(1):320.

doi: 10.1038/s41746-024-01315-1.

Authors

Affiliations

¹ Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA. eyal.klang@mountsinai.org.
² The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA. eyal.klang@mountsinai.org.
³ Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁴ Institute for Health Equity Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁵ Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁶ The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁷ Mount Sinai Fuster Heart Hospital, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁸ Department of Anesthesiology, Perioperative and Pain Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁹ The Samuel Bronfman Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
¹⁰ Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA. girish.nadkarni@mountsinai.org.
¹¹ The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA. girish.nadkarni@mountsinai.org.
¹² Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.

PMID: 39558090
PMCID: PMC11574261
DOI: 10.1038/s41746-024-01315-1

Abstract

Large language models (LLMs) can optimize clinical workflows; however, the economic and computational challenges of their utilization at the health system scale are underexplored. We evaluated how concatenating queries with multiple clinical notes and tasks simultaneously affects model performance under increasing computational loads. We assessed ten LLMs of different capacities and sizes utilizing real-world patient data. We conducted >300,000 experiments of various task sizes and configurations, measuring accuracy in question-answering and the ability to properly format outputs. Performance deteriorated as the number of questions and notes increased. High-capacity models, like Llama-3-70b, had low failure rates and high accuracies. GPT-4-turbo-128k was similarly resilient across task burdens, but performance deteriorated after 50 tasks at large prompt sizes. After addressing mitigable failures, these two models can concatenate up to 50 simultaneous tasks effectively, with validation on a public medical question-answering dataset. An economic analysis demonstrated up to a 17-fold cost reduction at 50 tasks using concatenation. These results identify the limits of LLMs for effective utilization and highlight avenues for cost-efficiency at the enterprise scale.

PubMed Disclaimer

Conflict of interest statement

Competing interests Girish Nadkarni is an Associate Editor for NPJ Digital Medicine. The authors declare no competing interests.

Figures

**Fig. 1. Study design and overview.**
Figure 1 shows the overall study. We used real-world notes from the electronic health records. We then used GPT-4–8k to create bespoke question-answer pairs of multiple types, namely fact-based, temporal, and numerical. We then tested 10 other LLMs of various sizes against different burdens of questions and notes and assessed the accuracy and proper formatting performance.

**Fig. 2. Performance of formatting JSON output responses.**
The overall accuracy of question answers for the clinical notes is presented across question burdens for each task size. Results are presented when including Omission Failures for Small (a), Medium (c), and Large (e) tasks. Results are also presented when excluding Omission Failures for Small (b), Medium (d), and Large (f). The shaded area reflects 95% Confidence Intervals. Models that were unable to properly format responses were not included.

**Fig. 3. Accuracy of LLMs for question–answer pairs across task sizes with or without Omission Failures.**
Assessment of LLMs’ ability to properly format outputs aggregated across configurations by task size. Models that had high JSON failure rates were not assessed for Omission failures. The JSON loading error failure rates are plotted with 95% confidence intervals across all experiments (a) Small, b Medium, and c Large task sizes. Omission errors are plotted similarly for (d) Small, e Medium, and f Large task sizes.

**Fig. 4. Graphical representation of output accuracy for a given experiment highlighting different types of errors.**
Graphical depiction of accuracy and errors for GPT-4-Turbo-128k for the Medium task (4 notes) and 15 question burden for each. Omission and JSON failures are demonstrated in addition to correct and incorrect answers.

**Fig. 5. Impact of prompt size on LLM performance for 50 total tasks.**
The impact of prompt size when a number of tasks is held constant at 50 for GPT-3.5-turbo-16k, GPT-4-turbo-128k, Mixtral-8x22B, and Mixtral-8x7B, with 50 iterations for each experiment. a JSON failure rate; b Omission failure rate; c overall accuracy including Omission failures as errors; and d overall accuracy excluding Omission failures as errors aggregated across configurations. As notes increased, the number of tokens increased, but the number of questions decreased (i.e., 2 notes with 25 questions each, 5 notes with 10 questions each, etc.). Shaded areas in plots (c) and d reflect 95% Confidence Intervals.

See this image and copyright information in PMC

References

1. Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinform.25, bbad493 (2023). - DOI - PMC - PubMed
1. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med.29, 1930–1940 (2023). - DOI - PubMed
1. Yang, X. et al. A large language model for electronic health records. Npj Digit. Med.5, 194 (2022). - DOI - PMC - PubMed
1. Guevara, M. et al. Large language models to identify social determinants of health in electronic health records. Npj Digit. Med.7, 6 (2024). - DOI - PMC - PubMed
1. Sushil, M. et al. A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports. J. Am. Med. Inform. Assoc.10.1093/jamia/ocae146 (2024). - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A strategy for cost-effective large language model use at health system-scale

Affiliations

A strategy for cost-effective large language model use at health system-scale

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources