Benchmark evaluation of DeepSeek large language models in clinical decision-making

Sarah Sandmann¹, Stefan Hegselmann², Michael Fujarski¹, Lucas Bickmann¹, Benjamin Wild², Roland Eils^{3

4}, Julian Varghese⁵

Affiliations

¹ Institute of Medical Informatics, University of Münster, Münster, Germany.
² Center for Digital Health, Berlin Institute of Health, Charité - University Medicine Berlin, Berlin, Germany.
³ Center for Digital Health, Berlin Institute of Health, Charité - University Medicine Berlin, Berlin, Germany. roland_eils@fudan.edu.cn.
⁴ Intelligent Medicine Institute, Fudan University, Shanghai, China. roland_eils@fudan.edu.cn.
⁵ Institute of Medical Data Science, Otto-von-Guericke University, Magdeburg, Germany.

PMID: 40267970
PMCID: PMC12353792
DOI: 10.1038/s41591-025-03727-2

Benchmark evaluation of DeepSeek large language models in clinical decision-making

Sarah Sandmann et al. Nat Med. 2025 Aug.

. 2025 Aug;31(8):2546-2549.

doi: 10.1038/s41591-025-03727-2. Epub 2025 Apr 23.

Authors

Sarah Sandmann¹, Stefan Hegselmann², Michael Fujarski¹, Lucas Bickmann¹, Benjamin Wild², Roland Eils^{3

4}, Julian Varghese⁵

Affiliations

¹ Institute of Medical Informatics, University of Münster, Münster, Germany.
² Center for Digital Health, Berlin Institute of Health, Charité - University Medicine Berlin, Berlin, Germany.
³ Center for Digital Health, Berlin Institute of Health, Charité - University Medicine Berlin, Berlin, Germany. roland_eils@fudan.edu.cn.
⁴ Intelligent Medicine Institute, Fudan University, Shanghai, China. roland_eils@fudan.edu.cn.
⁵ Institute of Medical Data Science, Otto-von-Guericke University, Magdeburg, Germany.

PMID: 40267970
PMCID: PMC12353792
DOI: 10.1038/s41591-025-03727-2

Abstract

Large language models (LLMs) are increasingly transforming medical applications. However, proprietary models such as GPT-4o face significant barriers to clinical adoption because they cannot be deployed on site within healthcare institutions, making them noncompliant with stringent privacy regulations. Recent advancements in open-source LLMs such as DeepSeek models offer a promising alternative because they allow efficient fine-tuning on local data in hospitals with advanced information technology infrastructure. Here, to demonstrate the clinical utility of DeepSeek-V3 and DeepSeek-R1, we benchmarked their performance on clinical decision support tasks against proprietary LLMs, including GPT-4o and Gemini-2.0 Flash Thinking Experimental. Using 125 patient cases with sufficient statistical power, covering a broad range of frequent and rare diseases, we found that DeepSeek models perform equally well and in some cases better than proprietary LLMs. Our study demonstrates that open-source LLMs can provide a scalable pathway for secure model training enabling real-world medical applications in accordance with data privacy and healthcare regulations.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

**Fig. 1. Model performance for diagnosis tasks.**
a–d, Bubble plots showing the results of the 125 pairwise comparisons on a 5-point Likert scale for GPT-4o versus DeepSeek-R1 (a) (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 0.3085, V = 378, 95% CI −3.13 × 10⁻⁷ to infinity, estimate 0.25); GPT-4o versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 7.89 × 10⁻⁶, V = 1,576, 95% CI 0.5 to infinity, estimate 0.75) (b); DeepSeek-R1 versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 5.73 × 10⁻⁵, V = 1,515, 95% CI 0.5 to infinity, estimate 0.5) (c); and DeepSeek-R1 versus DeepSeek-V3 (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 1, V = 307, 95% CI −0.25 to infinity, estimate 1.97 × 10⁻⁵) (d). e, Violin plots comparing the Likert scores of GPT-4o, DeepSeek-R1, DeepSeek-V3 and Gem2FTE with those of GPT-4, GPT-3.5 and Google in our previous study (n.s., not significant; ***P < 0.001; significance levels visualizing the results of statistical tests performed in a–d). Explorative comparison of the n = 110 cases analyzed by all seven models with the n = 15 newly added cases shows that the performance scores align well (one-sided unpaired Mann–Whitney test, alternative = greater; GPT-4o: pGPT-4o 0.5441, W = 813.5, 95% CI −1.84 × 10⁻⁵ to infinity, estimate −4.99 × 10⁻⁵; DeepSeek-R1: pDeepSeek-R1 0.7710, W = 740, 95% CI 3.75 × 10⁻⁵ to infinity, estimate −2.16 × 10⁻⁵; DeepSeek-V3: pDeepSeek-V3 0.6678, W = 775.5, 95% CI −7.45 × 10⁻⁵ to infinity, estimate 5.91 × 10⁻⁵; Gem2FTE: pGem2FTE 0.9899, W = 540, 95% CI −0.5 to infinity, estimate −3.51 × 10⁻⁵). f, The cumulative frequency of the Likert scores for GPT-4o, DeepSeek-R1, DeepSeek-V3, Gem2FTE and GPT-4.

**Fig. 2. Model performance for treatment recommendation tasks.**
a–c, Bubble plots showing the results of the 125 pairwise comparisons on a 5-point Likert scale for GPT-4o versus DeepSeek-R1 (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 3, adjusted P = 0.1522, V = 771.5, 95% CI −6.88 × 10⁻⁵ to infinity, estimate 0.25) (a); GPT-4o versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 3, adjusted P = 0.0016, V = 1154, 95% CI 0.2501 to infinity, estimate 0.5) (b); DeepSeek-R1 versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 3, adjusted P = 0.0235, V = 1,124, 95% CI 4.21 × 10⁻⁶ to infinity, estimate 0.5) (c). d, Violin plots comparing the Likert scores of GPT-4o, DeepSeek-R1 and Gem2FTE with GPT-4 and GPT-3.5 (n.s., not significant; *P < 0.05; significance levels visualizing the results of statistical tests performed in a–c). Explorative comparison of the n = 110 cases analyzed by all seven models with the n = 15 newly added cases shows that the performance scores align well (one-sided unpaired Mann–Whitney test, alternative = greater; GPT-4o: pGPT-4o 0.1460, W = 955, 95% CI −5.38 × 10⁻⁵ to infinity, estimate 3.16 × 10⁻⁵; DeepSeek-R1: pDeepSeek-R1 0.5256, W = 817.5, 95% CI −1.46 × 10⁻⁵ to infinity, estimate −1.73 × 10⁻⁵; Gem2FTE: pGem2FTE 0.4591, W = 838.5, 95% CI −9.54 × 10⁻⁶ to infinity, estimate −6.10 × 10⁻⁵). e, The cumulative frequency of Likert scores for GPT-4o, DeepSeek-R1, Gem2FTE and GPT-4.

**Extended Data Fig. 1**
**Visual abstract**.

**Extended Data Fig. 2. Summarized model performances for diagnosis and treatment recommendation tasks.**
Histograms showing the performance of GPT-4o, DeepSeek-R1, Gemini-2.0 Flash Thinking Experimental (Gem2FTE) and DeepSeek-V3 considering diagnosis and treatment, rated with Likert scores. Five points represent the highest possible level of accuracy as assessed by the expert. The red line indicates the mean performance of each model.

See this image and copyright information in PMC

References

1. Quer, G. & Topol, E. J. The potential for large language models to transform cardiovascular medicine. Lancet Digit. Health6, e767–e771 (2024). - PubMed
1. Bellini, V. & Bignami, E. G. Generative Pre-trained Transformer 4 (GPT-4) in clinical settings. Lancet Digit. Health7, e6–e7 (2025). - PubMed
1. Aaron, B. et al. Large language models for more efficient reporting of hospital quality measures. NEJM AI10.1056/aics2400420 (2024). - PMC - PubMed
1. McCoy, T. H. & Perlis, R. H. Applying large language models to stratify suicide risk using narrative clinical notes. J. Mood Anxiety Disord.10, 100109 (2025). - PMC - PubMed
1. Ahsan, H. et al. Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges. Proc. Mach. Learn Res.248, 489–505 (2024). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Benchmark evaluation of DeepSeek large language models in clinical decision-making

Affiliations

Benchmark evaluation of DeepSeek large language models in clinical decision-making

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources