Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Dec;648(8093):312-319.
doi: 10.1038/s41586-025-09761-x. Epub 2025 Oct 22.

Discovering state-of-the-art reinforcement learning algorithms

Affiliations

Discovering state-of-the-art reinforcement learning algorithms

Junhyuk Oh et al. Nature. 2025 Dec.

Abstract

Humans and other animals use powerful reinforcement learning (RL) mechanisms that have been discovered by evolution over many generations of trial and error. By contrast, artificial agents typically learn using handcrafted learning rules. Despite decades of interest, the goal of autonomously discovering powerful RL algorithms has proven to be elusive1-6. Here we show that it is possible for machines to discover a state-of-the-art RL rule that outperforms manually designed rules. This was achieved by meta-learning from the cumulative experiences of a population of agents across a large number of complex environments. Specifically, our method discovers the RL rule by which the agent's policy and predictions are updated. In our large-scale experiments, the discovered rule surpassed all existing rules on the well-established Atari benchmark and outperformed a number of state-of-the-art RL algorithms on challenging benchmarks that it had not seen during discovery. Our findings suggest that the RL algorithms required for advanced artificial intelligence may soon be automatically discovered from the experiences of agents, rather than manually designed.

PubMed Disclaimer

Conflict of interest statement

Competing interests: A patent application(s) directed to aspects of the work described has been filed and is pending as of the date of manuscript submission. Google LLC has ownership and potential commercial interests in the work described.

Figures

Fig. 1
Fig. 1. Discovering an RL rule from a population of agents.
a, Discovery. Multiple agents, interacting with various environments, are trained in parallel according to the learning rule, defined by the meta-network. In the meantime, the meta-network is optimized to improve the agents’ collective performances. b, Agent architecture. An agent produces the following outputs: (1) a policy (π), (2) an observation-conditioned prediction vector (y), (3) action-conditioned prediction vectors (z), (4) action values (q) and (5) an auxiliary policy prediction (p). The semantics of y and z are determined by the meta-network. c, Meta-network architecture. A trajectory of the agent’s outputs is given as input to the meta-network, together with rewards and episode termination indicators from the environment (omitted for simplicity in the figure). Using this information, the meta-network produces targets for all of the agent’s predictions from the current and future time steps. The agent is updated to minimize the prediction errors with respect to their targets. LSTM, long short-term memory. d, Meta-optimization. The meta-parameters of the meta-network are updated by taking a meta-gradient step calculated from backpropagation through the agent’s update process (θ0 → θN), where the meta-objective is to maximize the collective returns of the agents in their environments.
Fig. 2
Fig. 2. Evaluation of DiscoRL.
af, Performance of DiscoRL compared to human-designed RL rules on Atari (a), ProcGen (b), DMLab (c), Crafter (d; figure inset shows results for 1 million environment steps), NetHack (e), and Sokoban (f). The x axis represents the number of environment steps in millions. The y axis represents the human-normalized IQM score for benchmarks consisting of multiple tasks (Atari, ProcGen and DMLab-30) and average return for the rest. Disco57 (blue) is discovered from the Atari benchmark and Disco103 (orange) is discovered from Atari, ProcGen and DMLab-30 benchmarks. The shaded areas show 95% confidence intervals. The dashed lines represent manually designed RL rules such as MuZero, efficient memory-based exploration agent (MEME) , Dreamer, self-tuning actor-critic algorithm (STACX), importance-weighted actor-learner architecture (IMPALA), deep Q-network (DQN) , phasic policy gradient (PPG), proximal policy optimization (PPO), and Rainbow.
Fig. 3
Fig. 3. Properties of discovery process.
a, Discovery efficiency. The best DiscoRL was discovered within 3 simulations of the agent’s lifetimes (200 million steps) per game. b, Scalability. DiscoRL becomes stronger on the ProcGen benchmark (30 million environment steps for all methods) as the training set of environments grows. c, Ablation. The plot shows the performances of variations of DiscoRL on Atari. ‘Without auxiliary prediction’ is meta-learned without the auxiliary prediction (p). ‘Small agents’ uses a smaller agent network during discovery. ‘Without prediction’ is meta-learned without learned predictions (y, z). ‘Without value’ is meta-learned without the value function (q). ‘Toy environments’ is meta-learned from 57 grid-world tasks instead of Atari games.
Fig. 4
Fig. 4. Analysis of DiscoRL.
a, Behaviour of discovered predictions. The plot shows how the agent’s discovered prediction (y) changes along with other quantities in Ms Pacman (left) and Breakout (right). ‘Confidence’ is calculated as negative entropy. Spikes in prediction confidence are correlated with upcoming salient events. For example, they often precede large rewards in Ms Pacman and strong action preferences in Breakout. b, Gradient analysis. Each contour shows where each prediction focuses on in the observation through a gradient analysis in Beam Rider. The predictions tend to focus more on enemies at a distance, whereas the policy and the value tend to focus on nearby enemies and the scoreboard, respectively. c, Prediction analysis. Future entropy and large-reward events can be better predicted from discovered predictions. The shaded areas represent 95% confidence intervals. d, Bootstrapping horizon. The plot shows how much the prediction target produced by DiscoRL changes when the prediction at each time step is perturbed. The individual curves correspond to 16 randomly sampled trajectories and the bold curve corresponds to the average over them. e, Reliance on predictions. The plot shows the performance of the controlled DiscoRL on Ms Pacman without bootstrapping when updating predictions and without using predictions at all. The shaded areas represent 95% confidence intervals.
Extended Data Fig. 1
Extended Data Fig. 1. Robustness of DiscoRL.
The plots show the performance of Disco57 and Muesli on Ms Pacman by varying agent settings. ‘Discovery’ and ‘Evaluation’ represent the setting used for discovery and evaluation, respectively. (a) Each rule was evaluated on various agent network sizes. (b) Each rule was evaluated on various replay ratios, which define the proportion of replay data in a batch compared to on-policy data. (c) A sweep over optimisers (Adam or RMSProp), learning rates, weight decays, and gradient clipping thresholds was evaluated (36 combinations in total) and ranked according to the final score.
Extended Data Fig. 2
Extended Data Fig. 2. Detailed results for the regression and classification analysis.
Each cell represents the test score of one MLP model that has been trained to predict some quantity (columns) given the agent’s outputs (rows).
Extended Data Fig. 3
Extended Data Fig. 3. Effect of meta-network architecture.
(a) The x-axis represents the number of environment steps in evaluation and the y-axis the IQM on the Atari benchmark. All methods are discovered from 16 randomly selected Atari games. The meta-RNN component slightly improves performance. The shaded areas show 95% confidence intervals. (b) The x-axis represents the number of environment steps in evaluation and the y-axis the IQM on the Atari benchmark. All methods are discovered from 16 randomly selected Atari games. Each curve corresponds to a different meta-network architecture, with varying number of LSTM hidden units, or its LSTM component is replaced by a transformer. The choice of the meta-net architecture minimally affects performance. The shaded areas show 95% confidence intervals.
Extended Data Fig. 4
Extended Data Fig. 4. Computational cost comparison.
The x-axis represents the amount of TPU hours spent for evaluation. The y-axis represents the performance on the Atari benchmark. Each algorithm was evaluated on 57 Atari games for 200 M environment steps. DiscoRL reached MuZero’s final performance with approximately 40% less computation.

References

    1. Kirsch, L., van Steenkiste, S. & Schmidhuber, J. Improving generalization in meta reinforcement learning using learned objectives. In Proc. International Conference on Learning Representations (ICLR, 2020).
    1. Kirsch, L. et al. Introducing symmetries to black box meta reinforcement learning. In Proc. AAAI Conference on Artificial Intelligence36, 7202–7210 (Association for the Advancement of Artificial Intelligence, 2022).
    1. Oh, J. et al. Discovering reinforcement learning algorithms. In Proc. Adv. Neural Inf. Process. Syst.33, 1060–1070 (NeurIPS, 2020).
    1. Xu, Z. et al. Meta-gradient reinforcement learning with an objective discovered online. In Proc. Adv. Neural Inf. Process. Syst.33, 15254–15264 (NeurIPS, 2020).
    1. Houthooft, R. et al. Evolved policy gradients. In Proc. Adv. Neural Inf. Process. Syst.31, 5405–5414 (NeurIPS, 2018).

LinkOut - more resources