Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Dec 26.
doi: 10.1038/s41746-025-02277-8. Online ahead of print.

A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains

Affiliations
Free article

A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains

Shirui Wang et al. NPJ Digit Med. .
Free article

Abstract

Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 metrics covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and revised 2069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.

PubMed Disclaimer

Conflict of interest statement

Competing interests: SW, TG, YW, WS, ZL, KM, DY, HG and LM are employees of Medlinker Intelligent and Digital Technology Co., Ltd, Beijing, China, the developers of the MedGPT model evaluated in this study. These authors contributed to the study concept only. The other authors declare no competing interests.

References

    1. Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large Language Models in Medicine: The Potentials and Pitfalls: A Narrative Review. Ann. Intern. Med. 177, 210–220 (2024).
    1. McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
    1. Bedi, S. et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 333, 319–328 (2025).
    1. Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
    1. Tordjman, M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat. Med. https://doi.org/10.1038/s41591-025-03726-3 (2025).

LinkOut - more resources