Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 Dec;43(12):980-997.
doi: 10.1016/j.tins.2020.09.004. Epub 2020 Oct 19.

Distributional Reinforcement Learning in the Brain

Affiliations
Review

Distributional Reinforcement Learning in the Brain

Adam S Lowet et al. Trends Neurosci. 2020 Dec.

Abstract

Learning about rewards and punishments is critical for survival. Classical studies have demonstrated an impressive correspondence between the firing of dopamine neurons in the mammalian midbrain and the reward prediction errors of reinforcement learning algorithms, which express the difference between actual reward and predicted mean reward. However, it may be advantageous to learn not only the mean but also the complete distribution of potential rewards. Recent advances in machine learning have revealed a biologically plausible set of algorithms for reconstructing this reward distribution from experience. Here, we review the mathematical foundations of these algorithms as well as initial evidence for their neurobiological implementation. We conclude by highlighting outstanding questions regarding the circuit computation and behavioral readout of these distributional codes.

Keywords: artificial intelligence; deep neural networks; dopamine; machine learning; population coding; reward.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Deep reinforcement learning
(a) A formulation of reinforcement learning problems. In reinforcement learning, an agent learns what action to take in a given state in order to maximize the cumulative sum of future rewards. In video games such as in an Atari game (here the Space Invader game is shown), an agent chooses which action (a(t), joystick turn, button press) to take based on the current state (s(t), pixel images). The reward (r(t)) is defined as the points that the agent or player earns. After David Silver’s lecture slide (https://www.davidsilver.uk/teaching/). (b) Structure of deep Q-network (DQN). A deep artificial neural network (more specifically, a convolutional neural network) takes as input a high-dimensional state vector (pixel images of 4 consecutive Atari game frames) along with sparse scalar rewards, and returns as output a vector corresponding to the value of each possible action given that state (called action values or Q-values and denoted Q(s, a)). The agent chooses actions based on these Q-values. To improve performance, the original DQN implemented a technique called “experience replay,” whereby a sequence of events are stored in a memory buffer and replayed randomly during training [2]. This helped remove correlations in the observation sequence, which had previously prevented RL algorithms from being used to train neural networks. Modified after [2]. (c) Difference between traditional and distributional reinforcement learning. Distributional DQN estimates a complete reward distribution for each allowable action. Modified after [6]. (d) Performance of different RL algorithms in DQN. Gray, DQN using a traditional RL algorithm [2]. Light blue, DQN using a categorical distributional RL algorithm (C51 algorithm [6]). Blue, DQN using a distributional RL based on quantile regression [7]. Modified after [7].
Figure 2.
Figure 2.. Learning rules of distributional RL (quantile and expectile regression)
(a) The standard Rescorla-Wagner learning rule converges to the mean of the reward distribution. (b) Modifying the update rule to use only the sign of the prediction error causes the associated value predictor to converge to the median of the reward distribution. (c-d) Adding diversity to the learning rates alongside a binarized update rule that follows the sign of the prediction error causes a family of value predictors to converge to quantiles of the reward distribution. More precisely, the value qτi to which predictor i converges is the τi-th quantile of the distribution, where τi is given by αi+αi++αi. This is illustrated for both unimodal (c) and bimodal (d) distributions. (e) The cumulative distribution function (CDF) is a familiar representation of a probability distribution. (f) By transposing this representation, we get the quantile function, or inverse CDF (left). Uniformly-spaced quantiles cluster in regions of higher probability density (right). Together, these quantiles encode the reward distribution in a non-parametric fashion. (g-h) Multiplying the prediction error by asymmetric learning rates yields expectiles. Relative to quantiles, expectiles are pulled toward the mean for both unimodal (g) and bimodal (h) distributions.
Figure 3.
Figure 3.. Distributional RL as minimizing a loss function
(a) The reward probabilities of an example reward distribution. Mean Vmean, median Vmedian, 0.25-quantile V0.25-quantile and 0.97-expectile V0.97-expectile of this distribution are indicated with different colors. (b-e) Loss as a function of the value estimate V (left) when the rewards follow the distribution presented in (a), illustrating that V = Vmean minimizes the mean squared error (b), V = Vmedian minimizes the mean absolute error (c), V = V0.25-quantile minimizes the quantile regression loss for τ = 0.25 (d), and V = V0.97-expectile minimizes the expectile regression loss for τ = 0.97 (e). The right panels show the loss as a function of the RPE δ.
Figure 4.
Figure 4.. The structured diversity of midbrain dopamine neurons is consistent with distributional RL
(a) Schematic of five different response functions (spiking activity of dopamine neurons) to positive and negative RPEs. In this model, the slope of the response function to positive and negative RPEs corresponds to the learning rates α+ and α. Diversity in α+ and α values results in different asymmetric scaling factors (αi+αi++αi). (b) RPE channels (δi) with α+ < α overweight negative prediction errors, resulting in pessimistic (blue) value predictors (Vi), while RPE channels with α+ > α overweight positive prediction errors and result in optimistic (red) value predictors. This representation corresponds to the Rescorla-Wagner approach in which RPE and value pairs form separate channels, with no crosstalk between channels with different scaling factors. See Box 2 for the general update rule when this condition is not met. (c) Given that different value predictors encode different reward magnitudes, the corresponding RPE channels will have diverse reversal points (reward magnitudes that elicit no RPE activity relative to baseline). The reversal points correspond to the values Vi of the τi-th expectiles of the reward distribution. (d) Reversal points are consistent across two different halves of the data, suggesting that the diversity observed is reliable (P = 1.8 × 10−5, each point represents a cell). Modified after [8]. (e) Diversity in asymmetric scaling in dopamine neurons tiles the entire [0, 1] interval and is statistically reliable (one-way ANOVA; F(38,234) = 2.93, P = 4 × 10−7). Modified after [8]. (f) Significant correlation between reversal points and asymmetric scaling in dopamine neurons (each point is a cell, linear regression P = 8.1 × 10−5). Grey traces show variability over simulations of the distributional TD algorithm run to calculate reversal points in this task. Modified after [8]. (g) Decoding of the reward distribution from dopamine cell activity using an expectile code. The expectiles of the distribution, {τi,eτi}, were defined by the asymmetries and reversal points of dopamine neurons. Grey area represents the smoothed reward distribution, light blue traces represent several decoding runs, and the dark blue trace their mean. Modified after [8].

Similar articles

Cited by

References

    1. LeCun Y et al. (2015) Deep learning. Nature 521, 436–444 - PubMed
    1. Mnih V et al. (2015) Human-level control through deep reinforcement learning. Nature 518, 529–533 - PubMed
    1. Silver D et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 - PubMed
    1. Botvinick M et al. (2019) Reinforcement Learning, Fast and Slow. Trends Cogn. Sci. (Regul. Ed.) 23, 408–422 - PubMed
    1. Hassabis D et al. (2017) Neuroscience-Inspired Artificial Intelligence. Neuron 95, 245–258 - PubMed

Publication types