Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 18;7(1):320.
doi: 10.1038/s41746-024-01315-1.

A strategy for cost-effective large language model use at health system-scale

Affiliations

A strategy for cost-effective large language model use at health system-scale

Eyal Klang et al. NPJ Digit Med. .

Abstract

Large language models (LLMs) can optimize clinical workflows; however, the economic and computational challenges of their utilization at the health system scale are underexplored. We evaluated how concatenating queries with multiple clinical notes and tasks simultaneously affects model performance under increasing computational loads. We assessed ten LLMs of different capacities and sizes utilizing real-world patient data. We conducted >300,000 experiments of various task sizes and configurations, measuring accuracy in question-answering and the ability to properly format outputs. Performance deteriorated as the number of questions and notes increased. High-capacity models, like Llama-3-70b, had low failure rates and high accuracies. GPT-4-turbo-128k was similarly resilient across task burdens, but performance deteriorated after 50 tasks at large prompt sizes. After addressing mitigable failures, these two models can concatenate up to 50 simultaneous tasks effectively, with validation on a public medical question-answering dataset. An economic analysis demonstrated up to a 17-fold cost reduction at 50 tasks using concatenation. These results identify the limits of LLMs for effective utilization and highlight avenues for cost-efficiency at the enterprise scale.

PubMed Disclaimer

Conflict of interest statement

Competing interests Girish Nadkarni is an Associate Editor for NPJ Digital Medicine. The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Study design and overview.
Figure 1 shows the overall study. We used real-world notes from the electronic health records. We then used GPT-4–8k to create bespoke question-answer pairs of multiple types, namely fact-based, temporal, and numerical. We then tested 10 other LLMs of various sizes against different burdens of questions and notes and assessed the accuracy and proper formatting performance.
Fig. 2
Fig. 2. Performance of formatting JSON output responses.
The overall accuracy of question answers for the clinical notes is presented across question burdens for each task size. Results are presented when including Omission Failures for Small (a), Medium (c), and Large (e) tasks. Results are also presented when excluding Omission Failures for Small (b), Medium (d), and Large (f). The shaded area reflects 95% Confidence Intervals. Models that were unable to properly format responses were not included.
Fig. 3
Fig. 3. Accuracy of LLMs for question–answer pairs across task sizes with or without Omission Failures.
Assessment of LLMs’ ability to properly format outputs aggregated across configurations by task size. Models that had high JSON failure rates were not assessed for Omission failures. The JSON loading error failure rates are plotted with 95% confidence intervals across all experiments (a) Small, b Medium, and c Large task sizes. Omission errors are plotted similarly for (d) Small, e Medium, and f Large task sizes.
Fig. 4
Fig. 4. Graphical representation of output accuracy for a given experiment highlighting different types of errors.
Graphical depiction of accuracy and errors for GPT-4-Turbo-128k for the Medium task (4 notes) and 15 question burden for each. Omission and JSON failures are demonstrated in addition to correct and incorrect answers.
Fig. 5
Fig. 5. Impact of prompt size on LLM performance for 50 total tasks.
The impact of prompt size when a number of tasks is held constant at 50 for GPT-3.5-turbo-16k, GPT-4-turbo-128k, Mixtral-8x22B, and Mixtral-8x7B, with 50 iterations for each experiment. a JSON failure rate; b Omission failure rate; c overall accuracy including Omission failures as errors; and d overall accuracy excluding Omission failures as errors aggregated across configurations. As notes increased, the number of tokens increased, but the number of questions decreased (i.e., 2 notes with 25 questions each, 5 notes with 10 questions each, etc.). Shaded areas in plots (c) and d reflect 95% Confidence Intervals.

References

    1. Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinform.25, bbad493 (2023). - PMC - PubMed
    1. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med.29, 1930–1940 (2023). - PubMed
    1. Yang, X. et al. A large language model for electronic health records. Npj Digit. Med.5, 194 (2022). - PMC - PubMed
    1. Guevara, M. et al. Large language models to identify social determinants of health in electronic health records. Npj Digit. Med.7, 6 (2024). - PMC - PubMed
    1. Sushil, M. et al. A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports. J. Am. Med. Inform. Assoc.10.1093/jamia/ocae146 (2024). - PMC - PubMed

LinkOut - more resources