Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug;31(8):2546-2549.
doi: 10.1038/s41591-025-03727-2. Epub 2025 Apr 23.

Benchmark evaluation of DeepSeek large language models in clinical decision-making

Affiliations

Benchmark evaluation of DeepSeek large language models in clinical decision-making

Sarah Sandmann et al. Nat Med. 2025 Aug.

Abstract

Large language models (LLMs) are increasingly transforming medical applications. However, proprietary models such as GPT-4o face significant barriers to clinical adoption because they cannot be deployed on site within healthcare institutions, making them noncompliant with stringent privacy regulations. Recent advancements in open-source LLMs such as DeepSeek models offer a promising alternative because they allow efficient fine-tuning on local data in hospitals with advanced information technology infrastructure. Here, to demonstrate the clinical utility of DeepSeek-V3 and DeepSeek-R1, we benchmarked their performance on clinical decision support tasks against proprietary LLMs, including GPT-4o and Gemini-2.0 Flash Thinking Experimental. Using 125 patient cases with sufficient statistical power, covering a broad range of frequent and rare diseases, we found that DeepSeek models perform equally well and in some cases better than proprietary LLMs. Our study demonstrates that open-source LLMs can provide a scalable pathway for secure model training enabling real-world medical applications in accordance with data privacy and healthcare regulations.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Model performance for diagnosis tasks.
ad, Bubble plots showing the results of the 125 pairwise comparisons on a 5-point Likert scale for GPT-4o versus DeepSeek-R1 (a) (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 0.3085, V = 378, 95% CI −3.13 × 10−7 to infinity, estimate 0.25); GPT-4o versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 7.89 × 10−6, V = 1,576, 95% CI 0.5 to infinity, estimate 0.75) (b); DeepSeek-R1 versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 5.73 × 10−5, V = 1,515, 95% CI 0.5 to infinity, estimate 0.5) (c); and DeepSeek-R1 versus DeepSeek-V3 (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 1, V = 307, 95% CI −0.25 to infinity, estimate 1.97 × 10−5) (d). e, Violin plots comparing the Likert scores of GPT-4o, DeepSeek-R1, DeepSeek-V3 and Gem2FTE with those of GPT-4, GPT-3.5 and Google in our previous study (n.s., not significant; ***P < 0.001; significance levels visualizing the results of statistical tests performed in ad). Explorative comparison of the n = 110 cases analyzed by all seven models with the n = 15 newly added cases shows that the performance scores align well (one-sided unpaired Mann–Whitney test, alternative = greater; GPT-4o: pGPT-4o 0.5441, W = 813.5, 95% CI −1.84 × 10−5 to infinity, estimate −4.99 × 10−5; DeepSeek-R1: pDeepSeek-R1 0.7710, W = 740, 95% CI 3.75 × 10−5 to infinity, estimate −2.16 × 10−5; DeepSeek-V3: pDeepSeek-V3 0.6678, W = 775.5, 95% CI −7.45 × 10−5 to infinity, estimate 5.91 × 10−5; Gem2FTE: pGem2FTE 0.9899, W = 540, 95% CI −0.5 to infinity, estimate −3.51 × 10−5). f, The cumulative frequency of the Likert scores for GPT-4o, DeepSeek-R1, DeepSeek-V3, Gem2FTE and GPT-4.
Fig. 2
Fig. 2. Model performance for treatment recommendation tasks.
ac, Bubble plots showing the results of the 125 pairwise comparisons on a 5-point Likert scale for GPT-4o versus DeepSeek-R1 (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 3, adjusted P = 0.1522, V = 771.5, 95% CI −6.88 × 10−5 to infinity, estimate 0.25) (a); GPT-4o versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 3, adjusted P = 0.0016, V = 1154, 95% CI 0.2501 to infinity, estimate 0.5) (b); DeepSeek-R1 versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 3, adjusted P = 0.0235, V = 1,124, 95% CI 4.21 × 10−6 to infinity, estimate 0.5) (c). d, Violin plots comparing the Likert scores of GPT-4o, DeepSeek-R1 and Gem2FTE with GPT-4 and GPT-3.5 (n.s., not significant; *P < 0.05; significance levels visualizing the results of statistical tests performed in ac). Explorative comparison of the n = 110 cases analyzed by all seven models with the n = 15 newly added cases shows that the performance scores align well (one-sided unpaired Mann–Whitney test, alternative = greater; GPT-4o: pGPT-4o 0.1460, W = 955, 95% CI −5.38 × 10−5 to infinity, estimate 3.16 × 10−5; DeepSeek-R1: pDeepSeek-R1 0.5256, W = 817.5, 95% CI −1.46 × 10−5 to infinity, estimate −1.73 × 10−5; Gem2FTE: pGem2FTE 0.4591, W = 838.5, 95% CI −9.54 × 10−6 to infinity, estimate −6.10 × 10−5). e, The cumulative frequency of Likert scores for GPT-4o, DeepSeek-R1, Gem2FTE and GPT-4.
Extended Data Fig. 1
Extended Data Fig. 1
Visual abstract.
Extended Data Fig. 2
Extended Data Fig. 2. Summarized model performances for diagnosis and treatment recommendation tasks.
Histograms showing the performance of GPT-4o, DeepSeek-R1, Gemini-2.0 Flash Thinking Experimental (Gem2FTE) and DeepSeek-V3 considering diagnosis and treatment, rated with Likert scores. Five points represent the highest possible level of accuracy as assessed by the expert. The red line indicates the mean performance of each model.

References

    1. Quer, G. & Topol, E. J. The potential for large language models to transform cardiovascular medicine. Lancet Digit. Health6, e767–e771 (2024). - PubMed
    1. Bellini, V. & Bignami, E. G. Generative Pre-trained Transformer 4 (GPT-4) in clinical settings. Lancet Digit. Health7, e6–e7 (2025). - PubMed
    1. Aaron, B. et al. Large language models for more efficient reporting of hospital quality measures. NEJM AI10.1056/aics2400420 (2024). - PMC - PubMed
    1. McCoy, T. H. & Perlis, R. H. Applying large language models to stratify suicide risk using narrative clinical notes. J. Mood Anxiety Disord.10, 100109 (2025). - PMC - PubMed
    1. Ahsan, H. et al. Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges. Proc. Mach. Learn Res.248, 489–505 (2024). - PMC - PubMed

LinkOut - more resources