Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Aug 24:2025.08.22.25334049.
doi: 10.1101/2025.08.22.25334049.

Orchestrated multi agents sustain accuracy under clinical-scale workloads compared to a single agent

Affiliations

Orchestrated multi agents sustain accuracy under clinical-scale workloads compared to a single agent

Eyal Klang et al. medRxiv. .

Abstract

We tested state-of-the-art large language models (LLMs) in two configurations for clinical-scale workloads: a single agent handling heterogeneous tasks versus an orchestrated multi-agent system assigning each task to a dedicated worker. Across retrieval, extraction, and dosing calculations, we varied batch sizes from 5 to 80 to simulate clinical traffic. Multi-agent runs maintained high accuracy under load (pooled accuracy 90.6% at 5 tasks, 65.3% at 80) while single-agent accuracy fell sharply (73.1% to 16.6%), with significant differences beyond 10 tasks (FDR-adjusted p < 0.01). Multi-agent execution reduced token usage up to 65-fold and limited latency growth compared with single-agent runs. The design's isolation of tasks prevented context interference and preserved performance across four diverse LLM checkpoints. This is the first evaluation of LLM agent architectures under sustained, mixed-task clinical workloads, showing that lightweight orchestration can deliver accuracy, efficiency, and auditability at operational scale.

PubMed Disclaimer

Conflict of interest statement

Competing interest – None declared for all authors.

Figures

Figure 1.
Figure 1.
Overview of the pipeline design.
Figure 2.
Figure 2.
Accuracy and token usage across: above- GPT-4.1 mini model, while below represents the pooled data across the 4 models.

References

    1. Omar M. et al. Sociodemographic biases in medical decision making by large language models. Nat Med 1–9 (2025) doi: 10.1038/s41591-025-03626-6. - DOI - PubMed
    1. Mehandru N. et al. Evaluating large language models as agents in the clinic. NPJ Digit Med 7, 84 (2024). - PMC - PubMed
    1. Tu T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025). - PMC - PubMed
    1. Klang E. et al. A strategy for cost-effective large language model use at health system-scale. NPJ Digit Med 7, 320 (2024). - PMC - PubMed
    1. Singhal K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). - PMC - PubMed

Publication types

LinkOut - more resources