Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning

Mickael Tordjman^#^{1

2}, Zelong Liu^#¹, Murat Yuce^{1

2}, Valentin Fauveau¹, Yunhao Mei³, Jerome Hadjadj^{4

5}, Ian Bolger^{1

2}, Haidara Almansour⁶, Carolyn Horst⁷, Ashwin Singh Parihar⁸, Amine Geahchan^{1

2}, Anis Meribout¹, Nader Yatim⁹, Nicole Ng¹⁰, Phillip Robson^{1

2}, Alexander Zhou¹, Sara Lewis^{1

2}, Mingqian Huang^{1

2}, Timothy Deyer^{11

12}, Bachir Taouli^{1

2}, Hao-Chih Lee^{13

14}, Zahi A Fayad^{15

16

17}, Xueyan Mei^{18

19

20}

Affiliations

¹ BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
² Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
³ Erasmus University Rotterdam, Rotterdam, the Netherlands.
⁴ Sorbonne Universite, Service de Medecine Interne, Hopital Saint Antoine, AP-HP, Paris, France.
⁵ Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA.
⁶ Department of Diagnostic and Interventional Radiology, Tuebingen University Hospital, Tuebingen, Germany.
⁷ School of Biomedical Engineering and Imaging Sciences, King's College London, London, UK.
⁸ Mallinckrodt Institute of Radiology, Washington University School of Medicine, Saint Louis, MO, USA.
⁹ Department of Immunology and Immunotherapy, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
¹⁰ Department of Pulmonary, Critical Care and Sleep Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
¹¹ East River Medical Imaging, New York, NY, USA.
¹² Department of Radiology, Cornell Medicine, New York, NY, USA.
¹³ BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA. hao-chih.lee@mssm.edu.
¹⁴ Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA. hao-chih.lee@mssm.edu.
¹⁵ BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA. zahi.fayad@mssm.edu.
¹⁶ Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA. zahi.fayad@mssm.edu.
¹⁷ Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA. zahi.fayad@mssm.edu.
¹⁸ BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA. xueyan.mei@mssm.edu.
¹⁹ Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA. xueyan.mei@mssm.edu.
²⁰ Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA. xueyan.mei@mssm.edu.

^# Contributed equally.

PMID: 40267969
DOI: 10.1038/s41591-025-03726-3

Comparative Study

Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning

Mickael Tordjman et al. Nat Med. 2025 Aug.

. 2025 Aug;31(8):2550-2555.

doi: 10.1038/s41591-025-03726-3. Epub 2025 Apr 23.

Authors

Affiliations

¹ BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
² Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
³ Erasmus University Rotterdam, Rotterdam, the Netherlands.
⁴ Sorbonne Universite, Service de Medecine Interne, Hopital Saint Antoine, AP-HP, Paris, France.
⁵ Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA.
⁶ Department of Diagnostic and Interventional Radiology, Tuebingen University Hospital, Tuebingen, Germany.
⁷ School of Biomedical Engineering and Imaging Sciences, King's College London, London, UK.
⁸ Mallinckrodt Institute of Radiology, Washington University School of Medicine, Saint Louis, MO, USA.
⁹ Department of Immunology and Immunotherapy, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
¹⁰ Department of Pulmonary, Critical Care and Sleep Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
¹¹ East River Medical Imaging, New York, NY, USA.
¹² Department of Radiology, Cornell Medicine, New York, NY, USA.
¹³ BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA. hao-chih.lee@mssm.edu.
¹⁴ Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA. hao-chih.lee@mssm.edu.
¹⁵ BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA. zahi.fayad@mssm.edu.
¹⁶ Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA. zahi.fayad@mssm.edu.
¹⁷ Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA. zahi.fayad@mssm.edu.
¹⁸ BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA. xueyan.mei@mssm.edu.
¹⁹ Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA. xueyan.mei@mssm.edu.
²⁰ Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA. xueyan.mei@mssm.edu.

^# Contributed equally.

PMID: 40267969
DOI: 10.1038/s41591-025-03726-3

Abstract

DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. Here we assessed the capabilities of three LLMs- DeepSeek-R1, ChatGPT-o1 and Llama 3.1-405B-in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning on the basis of text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1 (accuracy 0.92) was slightly inferior to that of ChatGPT-o1 (accuracy 0.95; P = 0.04) but better than that of Llama 3.1-405B (accuracy 0.83; P < 10^-3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 versus 0.55; P = 0.76 and 0.74 versus 0.76; P = 0.06, using New England Journal of Medicine and Médicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.74 versus 0.81; P = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22 and 3.13, respectively, P = 0.005 and P < 10^-3). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 versus 4.8; P < 10^-3). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.

PubMed Disclaimer

Conflict of interest statement

Competing interests: T.D. is the managing partner of RadImageNet LLC and a paid consultant to GEHC and AirsMedical. X.M. is a paid consultant to RadImageNet LLC. The other authors declare no competing interests.

References

1. The Lancet Digital Health. Large language models: a new chapter in digital health. Lancet Digit. Health 6, e1 (2024). - PubMed
1. Gibney, E. Scientists flock to DeepSeek: how they’re using the blockbuster AI model. Nature https://doi.org/10.1038/d41586-025-00275-0 (2025).
1. Conroy, G. & Mallapaty, S. How China created AI model DeepSeek and shocked the world. Nature https://doi.org/10.1038/d41586-025-00259-0 (2025).
1. OpenAI. GPT-4 technical report. Preprint at http://arxiv.org/abs/2303.08774 (2023).
1. Grattafiori, A. et al. The Llama 3 Herd of Models. Preprint at https://arxiv.org/abs/2407.21783 (2024).

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning

Affiliations

Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning

Authors

Affiliations

Abstract

Conflict of interest statement

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources