DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Daya Guo¹, Dejian Yang¹, Haowei Zhang¹, Junxiao Song¹, Peiyi Wang¹, Qihao Zhu¹, Runxin Xu¹, Ruoyu Zhang¹, Shirong Ma¹, Xiao Bi¹, Xiaokang Zhang¹, Xingkai Yu¹, Yu Wu¹, Z F Wu¹, Zhibin Gou¹, Zhihong Shao¹, Zhuoshu Li¹, Ziyi Gao¹, Aixin Liu¹, Bing Xue¹, Bingxuan Wang¹, Bochao Wu¹, Bei Feng¹, Chengda Lu¹, Chenggang Zhao¹, Chengqi Deng¹, Chong Ruan¹, Damai Dai¹, Deli Chen¹, Dongjie Ji¹, Erhang Li¹, Fangyun Lin¹, Fucong Dai¹, Fuli Luo^{1

2}, Guangbo Hao¹, Guanting Chen¹, Guowei Li¹, H Zhang¹, Hanwei Xu¹, Honghui Ding¹, Huazuo Gao¹, Hui Qu¹, Hui Li¹, Jianzhong Guo¹, Jiashi Li¹, Jingchang Chen¹, Jingyang Yuan¹, Jinhao Tu^{1

3}, Junjie Qiu¹, Junlong Li¹, J L Cai¹, Jiaqi Ni¹, Jian Liang¹, Jin Chen¹, Kai Dong¹, Kai Hu^{1

4}, Kaichao You¹, Kaige Gao¹, Kang Guan¹, Kexin Huang^{1

5}, Kuai Yu¹, Lean Wang¹, Lecong Zhang¹, Liang Zhao¹, Litong Wang¹, Liyue Zhang¹, Lei Xu¹, Leyi Xia¹, Mingchuan Zhang¹, Minghua Zhang¹, Minghui Tang¹, Mingxu Zhou¹, Meng Li¹, Miaojun Wang¹, Mingming Li¹, Ning Tian¹, Panpan Huang¹, Peng Zhang¹, Qiancheng Wang¹, Qinyu Chen¹, Qiushi Du¹, Ruiqi Ge¹, Ruisong Zhang¹, Ruizhe Pan¹, Runji Wang¹, R J Chen¹, R L Jin¹, Ruyi Chen¹, Shanghao Lu¹, Shangyan Zhou¹, Shanhuang Chen¹, Shengfeng Ye¹, Shiyu Wang¹, Shuiping Yu¹, Shunfeng Zhou¹, Shuting Pan¹, S S Li¹, Shuang Zhou¹, Shaoqing Wu¹, Tao Yun¹, Tian Pei¹, Tianyu Sun¹, T Wang¹, Wangding Zeng¹, Wen Liu¹, Wenfeng Liang⁶, Wenjun Gao¹, Wenqin Yu^{1

5}, Wentao Zhang¹, W L Xiao¹, Wei An¹, Xiaodong Liu¹, Xiaohan Wang¹, Xiaokang Chen¹, Xiaotao Nie¹, Xin Cheng¹, Xin Liu¹, Xin Xie¹, Xingchao Liu¹, Xinyu Yang¹, Xinyuan Li^{1

5}, Xuecheng Su¹, Xuheng Lin¹, X Q Li¹, Xiangyue Jin¹, Xiaojin Shen¹, Xiaosha Chen¹, Xiaowen Sun¹, Xiaoxiang Wang¹, Xinnan Song¹, Xinyi Zhou¹, Xianzu Wang¹, Xinxia Shan¹, Y K Li¹, Y Q Wang¹, Y X Wei¹, Yang Zhang¹, Yanhong Xu¹, Yao Li¹, Yao Zhao¹, Yaofeng Sun¹, Yaohui Wang¹, Yi Yu¹, Yichao Zhang¹, Yifan Shi¹, Yiliang Xiong¹, Ying He¹, Yishi Piao¹, Yisong Wang¹, Yixuan Tan¹, Yiyang Ma¹, Yiyuan Liu¹, Yongqiang Guo¹, Yuan Ou¹, Yuduan Wang¹, Yue Gong^{1

5}, Yuheng Zou¹, Yujia He^{1

5}, Yunfan Xiong¹, Yuxiang Luo¹, Yuxiang You¹, Yuxuan Liu¹, Yuyang Zhou¹, Y X Zhu¹, Yanping Huang¹, Yaohui Li¹, Yi Zheng¹, Yuchen Zhu¹, Yunxian Ma¹, Ying Tang¹, Yukun Zha¹, Yuting Yan¹, Z Z Ren¹, Zehui Ren¹, Zhangli Sha¹, Zhe Fu¹, Zhean Xu¹, Zhenda Xie¹, Zhengyan Zhang¹, Zhewen Hao¹, Zhicheng Ma¹, Zhigang Yan¹, Zhiyu Wu¹, Zihui Gu¹, Zijia Zhu¹, Zijun Liu^{1

7}, Zilin Li¹, Ziwei Xie¹, Ziyang Song^{1

8}, Zizheng Pan¹, Zhen Huang¹, Zhipeng Xu¹, Zhongyu Zhang¹, Zhen Zhang¹

Affiliations

¹ DeepSeek-AI Team, Hangzhou, China.
² Individual Researcher, Beijing, China.
³ Jianping High School, Shanghai, China.
⁴ University of Science and Technology of China, Hefei, China.
⁵ Peking University, Beijing, China.
⁶ DeepSeek-AI Team, Hangzhou, China. wenfeng.liang@deepseek.com.
⁷ Tsinghua University, Beijing, China.
⁸ Citadel Securities, Hong Kong SAR, China.

PMID: 40962978
PMCID: PMC12443585
DOI: 10.1038/s41586-025-09422-z

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Daya Guo et al. Nature. 2025 Sep.

. 2025 Sep;645(8081):633-638.

doi: 10.1038/s41586-025-09422-z. Epub 2025 Sep 17.

Authors

Affiliations

¹ DeepSeek-AI Team, Hangzhou, China.
² Individual Researcher, Beijing, China.
³ Jianping High School, Shanghai, China.
⁴ University of Science and Technology of China, Hefei, China.
⁵ Peking University, Beijing, China.
⁶ DeepSeek-AI Team, Hangzhou, China. wenfeng.liang@deepseek.com.
⁷ Tsinghua University, Beijing, China.
⁸ Citadel Securities, Hong Kong SAR, China.

PMID: 40962978
PMCID: PMC12443585
DOI: 10.1038/s41586-025-09422-z

Abstract

General reasoning represents a long-standing and formidable challenge in artificial intelligence (AI). Recent breakthroughs, exemplified by large language models (LLMs)^1,2 and chain-of-thought (CoT) prompting³, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent on extensive human-annotated demonstrations and the capabilities of models are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labelled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions and STEM fields, surpassing its counterparts trained through conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically used to guide and enhance the reasoning capabilities of smaller models.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests and will not file patents related to the content of this manuscript.

Figures

**Fig. 1. Accuracy and output length of DeepSeek-R1-Zero throughout the training process.**
a, AIME accuracy of DeepSeek-R1-Zero during training. AIME takes a mathematical problem as input and a number as output, illustrated in Extended Data Table 1. pass@1 and cons@16 are described in Supplementary Information, section 4.1. The baseline is the average score achieved by human participants in the AIME competition. b, The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time. Note that a training step refers to a single policy update operation.

**Fig. 2. The multistage pipeline of DeepSeek-R1.**
A detailed background on DeepSeek-V3 Base and DeepSeek-V3 is provided in Supplementary Information, section 1.1. The models DeepSeek-R1 Dev1, Dev2 and Dev3 represent intermediate checkpoints in this pipeline.

**Extended Data Fig. 1. Evolution of reasoning-related linguistic features in model outputs across training steps.**
a, Frequency of representative reflective terms in model-generated outputs throughout the training process. Reflective terms—including ‘wait’, ‘mistake’, ‘however’, ‘but’, ‘retry’, ‘error’, ‘verify’, ‘wrong’, ‘evaluate’ and ‘check’—were identified and curated by a panel of three human experts. Each expert independently proposed a set of words indicative of reflective reasoning, which were subsequently consolidated through consensus into a final vocabulary list. b, Frequency of the term ‘wait’ in model outputs over the course of training. This term was virtually absent during the initial training stages, appeared sporadically between steps 4,000 and 7,000 and exhibited a marked increase in frequency after step 8,000. These trends suggest the emergence of temporal reasoning or self-monitoring behaviour as training progresses.

**Extended Data Fig. 2. Illustration of the proposed GRPO for RL-based training.**
In the proposed framework, a LLM is used as a policy model to generate responses {o₁, o₂,…, o_G} conditioned on a given query q. Each response within the group is evaluated by a reward model—either learned (model-based) or manually specified (rule-based)—to assign a scalar reward signal. Subsequently, GRPO computes the relative advantages of each group member based on their assigned rewards. Rather than relying on an explicit value function, as in PPO, GRPO directly estimates advantages from the intra-group reward distribution. The policy parameters are then updated to maximize the expected reward while simultaneously minimizing divergence from a reference policy, typically quantified through the KL divergence. By eliminating the need for a separate value network, GRPO offers a simplified yet effective alternative to traditional actor-critic methods such as PPO.

See this image and copyright information in PMC

References

1. Brown, T. B. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33 (eds Larochelle, H. et al.) (ACM, 2020).
1. OpenAI et al. GPT4 technical report. Preprint at 10.48550/arXiv.2303.08774 (2024).
1. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35 (eds Koyejo, S. et al.) 24824–24837 (ACM, 2022).
1. Wei, J. et al. Emergent abilities of large language models. In Transactions on Machine Learning Research (eds Kamath, G. et al.) (2022).
1. Kaplan, J. et al. Scaling laws for neural language models. Preprint at 10.48550/arXiv.2001.08361 (2020).

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Affiliations

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous