A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains

Shirui Wang^#¹, Zhihui Tang^#², Huaxia Yang^#³, Qiuhong Gong^#⁴, Tiantian Gu^#¹, Hongyang Ma^#², Yongxin Wang¹, Wubin Sun¹, Zeliang Lian¹, Kehang Mao¹, Yinan Jiang⁵, Zhicheng Huang⁶, Lingyun Ma⁷, Wenjie Shen⁸, Yajie Ji⁹, Yunhui Tan¹⁰, Chunbo Wang¹¹, Yunlu Gao¹², Qianling Ye¹³, Rui Lin¹⁴, Mingyu Chen¹⁵, Lijuan Niu¹⁶, Zhihao Wang¹⁷, Peng Yu¹⁸, Mengran Lang¹⁷, Yue Liu¹⁷, Huimin Zhang¹⁹, Haitao Shen²⁰, Long Chen²¹, Qiguang Zhao²², Si-Xuan Liu⁹, Lina Zhou²³, Hua Gao¹, Dongqiang Ye¹, Lingmin Meng¹, Youtao Yu²⁴, Naixin Liang²⁵, Jianxiong Wu²⁶

Affiliations

¹ Medlinker Intelligent and Digital Technology Co. Ltd., Beijing, China.
² Peking University School of Stomatology, Haidian, Beijing, China.
³ Department of Rheumatology and Clinical Immunology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
⁴ Center of Endocrinology, National Center of Cardiology & Fuwai Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
⁵ Department of Psychological Medicine, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
⁶ Department of Thoracic Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
⁷ Department of Respiratory and Critical Care Medicine, the 8th Medical Center of PLA General Hospital, Beijing, China.
⁸ Department of Obstetrics & Gynecology, the Fourth Medical Center of PLA General Hospital, Beijing, China.
⁹ Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Shanghai, China.
¹⁰ Department of Urology, The Second Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang Province, China.
¹¹ Department of Radiation Oncology, Harbin Medical University Cancer Hospital, Harbin, Heilongjiang Province, China.
¹² Department of Dermatology, Shanghai Skin Disease Hospital, Tongji University School of Medicine, Shanghai, China.
¹³ Department of Oncology, East Hospital Affiliated to Tongji University, Tongji University School of Medicine, Tongji University, Shanghai, China.
¹⁴ General Surgery Department, Tongji Hospital, School of Medicine, Tongji University, Shanghai, China.
¹⁵ Department of Neurosurgery, Huashan Hospital, Shanghai Medical College, Fudan University, Shanghai, China.
¹⁶ Department of Ultrasound, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
¹⁷ Department of Hepatobiliary Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
¹⁸ Department of General Surgery, The Fourth Affiliated Hospital of Xinjiang Medical University, Urumqi, Xinjiang Uygur Autonomous Region, China.
¹⁹ Department of Otolaryngology-Head and Neck Surgery, Shanxi Bethune Hospital, Shanxi Academy of Medical Sciences, Tongji Shanxi Hospital, Third Hospital of Shanxi Medical University, Taiyuan, Shanxi Province, China.
²⁰ Department of Clinical Laboratory, Seventh People's Hospital of Shanghai University of Traditional Chinese Medicine, Shanghai, China.
²¹ Department of Orthopedics, Guangzhou Red Cross Hospital of Jinan University, Guangzhou, Guangdong Province, China.
²² Department of Imageology, Anzhen Hospital. Capital Medical University, Beijing, China.
²³ Beijing EuroEyes, Beijing, China.
²⁴ Department of Interventional Radiology, the Fourth Medical Center of Chinese PLA General Hospital, Beijing, China. yuyoutao@126.com.
²⁵ Department of Thoracic Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China. pumchnelson@163.com.
²⁶ Department of Hepatobiliary Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China. Dr_wujx@163.com.

^# Contributed equally.

PMID: 41454006
DOI: 10.1038/s41746-025-02277-8

Free article

A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains

Shirui Wang et al. NPJ Digit Med. 2025.

Free article

. 2025 Dec 26.

doi: 10.1038/s41746-025-02277-8. Online ahead of print.

Authors

Affiliations

¹ Medlinker Intelligent and Digital Technology Co. Ltd., Beijing, China.
² Peking University School of Stomatology, Haidian, Beijing, China.
³ Department of Rheumatology and Clinical Immunology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
⁴ Center of Endocrinology, National Center of Cardiology & Fuwai Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
⁵ Department of Psychological Medicine, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
⁶ Department of Thoracic Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
⁷ Department of Respiratory and Critical Care Medicine, the 8th Medical Center of PLA General Hospital, Beijing, China.
⁸ Department of Obstetrics & Gynecology, the Fourth Medical Center of PLA General Hospital, Beijing, China.
⁹ Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Shanghai, China.
¹⁰ Department of Urology, The Second Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang Province, China.
¹¹ Department of Radiation Oncology, Harbin Medical University Cancer Hospital, Harbin, Heilongjiang Province, China.
¹² Department of Dermatology, Shanghai Skin Disease Hospital, Tongji University School of Medicine, Shanghai, China.
¹³ Department of Oncology, East Hospital Affiliated to Tongji University, Tongji University School of Medicine, Tongji University, Shanghai, China.
¹⁴ General Surgery Department, Tongji Hospital, School of Medicine, Tongji University, Shanghai, China.
¹⁵ Department of Neurosurgery, Huashan Hospital, Shanghai Medical College, Fudan University, Shanghai, China.
¹⁶ Department of Ultrasound, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
¹⁷ Department of Hepatobiliary Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
¹⁸ Department of General Surgery, The Fourth Affiliated Hospital of Xinjiang Medical University, Urumqi, Xinjiang Uygur Autonomous Region, China.
¹⁹ Department of Otolaryngology-Head and Neck Surgery, Shanxi Bethune Hospital, Shanxi Academy of Medical Sciences, Tongji Shanxi Hospital, Third Hospital of Shanxi Medical University, Taiyuan, Shanxi Province, China.
²⁰ Department of Clinical Laboratory, Seventh People's Hospital of Shanghai University of Traditional Chinese Medicine, Shanghai, China.
²¹ Department of Orthopedics, Guangzhou Red Cross Hospital of Jinan University, Guangzhou, Guangdong Province, China.
²² Department of Imageology, Anzhen Hospital. Capital Medical University, Beijing, China.
²³ Beijing EuroEyes, Beijing, China.
²⁴ Department of Interventional Radiology, the Fourth Medical Center of Chinese PLA General Hospital, Beijing, China. yuyoutao@126.com.
²⁵ Department of Thoracic Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China. pumchnelson@163.com.
²⁶ Department of Hepatobiliary Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China. Dr_wujx@163.com.

^# Contributed equally.

PMID: 41454006
DOI: 10.1038/s41746-025-02277-8

Abstract

Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 metrics covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and revised 2069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.

PubMed Disclaimer

Conflict of interest statement

Competing interests: SW, TG, YW, WS, ZL, KM, DY, HG and LM are employees of Medlinker Intelligent and Digital Technology Co., Ltd, Beijing, China, the developers of the MedGPT model evaluated in this study. These authors contributed to the study concept only. The other authors declare no competing interests.

References

1. Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large Language Models in Medicine: The Potentials and Pitfalls: A Narrative Review. Ann. Intern. Med. 177, 210–220 (2024).
1. McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
1. Bedi, S. et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 333, 319–328 (2025).
1. Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
1. Tordjman, M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat. Med. https://doi.org/10.1038/s41591-025-03726-3 (2025).

LinkOut - more resources

Full Text Sources
- Nature Publishing Group

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains

Affiliations

A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains

Authors

Affiliations

Abstract

Conflict of interest statement

References

LinkOut - more resources

Full Text Sources