This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Aug 24:2025.08.22.25334049.

doi: 10.1101/2025.08.22.25334049.

Orchestrated multi agents sustain accuracy under clinical-scale workloads compared to a single agent

Eyal Klang^{1

2

3}, Mahmud Omar^{1

2

3}, Ganesh Raut¹, Reem Agbareia⁴, Prem Timsina¹, Robert Freeman¹, Nicholas Gavin¹, Lisa Stump¹, Alexander W Charney¹, Benjamin S Glicksberg^{1

5}, Girish N Nadkarni^{1

2

3}

Affiliations

¹ The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA.
² The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA.
³ The Hasso Plattner Institute of Digital Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁴ Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel.
⁵ Mindich Child Health and Development Institute and the Departments of Pediatrics and Genetics & Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.

PMID: 40894146
PMCID: PMC12393657
DOI: 10.1101/2025.08.22.25334049

Orchestrated multi agents sustain accuracy under clinical-scale workloads compared to a single agent

Eyal Klang et al. medRxiv. 2025.

[Preprint]. 2025 Aug 24:2025.08.22.25334049.

doi: 10.1101/2025.08.22.25334049.

Authors

Affiliations

¹ The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA.
² The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA.
³ The Hasso Plattner Institute of Digital Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁴ Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel.
⁵ Mindich Child Health and Development Institute and the Departments of Pediatrics and Genetics & Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.

PMID: 40894146
PMCID: PMC12393657
DOI: 10.1101/2025.08.22.25334049

Abstract

We tested state-of-the-art large language models (LLMs) in two configurations for clinical-scale workloads: a single agent handling heterogeneous tasks versus an orchestrated multi-agent system assigning each task to a dedicated worker. Across retrieval, extraction, and dosing calculations, we varied batch sizes from 5 to 80 to simulate clinical traffic. Multi-agent runs maintained high accuracy under load (pooled accuracy 90.6% at 5 tasks, 65.3% at 80) while single-agent accuracy fell sharply (73.1% to 16.6%), with significant differences beyond 10 tasks (FDR-adjusted p < 0.01). Multi-agent execution reduced token usage up to 65-fold and limited latency growth compared with single-agent runs. The design's isolation of tasks prevented context interference and preserved performance across four diverse LLM checkpoints. This is the first evaluation of LLM agent architectures under sustained, mixed-task clinical workloads, showing that lightweight orchestration can deliver accuracy, efficiency, and auditability at operational scale.

PubMed Disclaimer

Conflict of interest statement

Competing interest – None declared for all authors.

Figures

**Figure 1.**
Overview of the pipeline design.

**Figure 2.**
Accuracy and token usage across: above- GPT-4.1 mini model, while below represents the pooled data across the 4 models.

See this image and copyright information in PMC

References

1. Omar M. et al. Sociodemographic biases in medical decision making by large language models. Nat Med 1–9 (2025) doi: 10.1038/s41591-025-03626-6. - DOI - PubMed
1. Mehandru N. et al. Evaluating large language models as agents in the clinic. NPJ Digit Med 7, 84 (2024). - PMC - PubMed
1. Tu T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025). - PMC - PubMed
1. Klang E. et al. A strategy for cost-effective large language model use at health system-scale. NPJ Digit Med 7, 320 (2024). - PMC - PubMed
1. Singhal K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Orchestrated multi agents sustain accuracy under clinical-scale workloads compared to a single agent

Affiliations

Orchestrated multi agents sustain accuracy under clinical-scale workloads compared to a single agent

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources