Review

. 2020 Dec;43(12):980-997.

doi: 10.1016/j.tins.2020.09.004. Epub 2020 Oct 19.

Distributional Reinforcement Learning in the Brain

Adam S Lowet¹, Qiao Zheng², Sara Matias¹, Jan Drugowitsch³, Naoshige Uchida⁴

Affiliations

¹ Department of Molecular and Cellular Biology, Center for Brain Science, Harvard University, Cambridge, MA 02138, USA.
² Department of Neurobiology, Harvard Medical School, Boston, MA 02115, USA.
³ Department of Neurobiology, Harvard Medical School, Boston, MA 02115, USA. Electronic address: jan_drugowitsch@hms.harvard.edu.
⁴ Department of Molecular and Cellular Biology, Center for Brain Science, Harvard University, Cambridge, MA 02138, USA. Electronic address: uchida@mcb.harvard.edu.

PMID: 33092893
PMCID: PMC8073212
DOI: 10.1016/j.tins.2020.09.004

Review

Distributional Reinforcement Learning in the Brain

Adam S Lowet et al. Trends Neurosci. 2020 Dec.

. 2020 Dec;43(12):980-997.

doi: 10.1016/j.tins.2020.09.004. Epub 2020 Oct 19.

Authors

Adam S Lowet¹, Qiao Zheng², Sara Matias¹, Jan Drugowitsch³, Naoshige Uchida⁴

Affiliations

¹ Department of Molecular and Cellular Biology, Center for Brain Science, Harvard University, Cambridge, MA 02138, USA.
² Department of Neurobiology, Harvard Medical School, Boston, MA 02115, USA.
³ Department of Neurobiology, Harvard Medical School, Boston, MA 02115, USA. Electronic address: jan_drugowitsch@hms.harvard.edu.
⁴ Department of Molecular and Cellular Biology, Center for Brain Science, Harvard University, Cambridge, MA 02138, USA. Electronic address: uchida@mcb.harvard.edu.

PMID: 33092893
PMCID: PMC8073212
DOI: 10.1016/j.tins.2020.09.004

Abstract

Learning about rewards and punishments is critical for survival. Classical studies have demonstrated an impressive correspondence between the firing of dopamine neurons in the mammalian midbrain and the reward prediction errors of reinforcement learning algorithms, which express the difference between actual reward and predicted mean reward. However, it may be advantageous to learn not only the mean but also the complete distribution of potential rewards. Recent advances in machine learning have revealed a biologically plausible set of algorithms for reconstructing this reward distribution from experience. Here, we review the mathematical foundations of these algorithms as well as initial evidence for their neurobiological implementation. We conclude by highlighting outstanding questions regarding the circuit computation and behavioral readout of these distributional codes.

Keywords: artificial intelligence; deep neural networks; dopamine; machine learning; population coding; reward.

PubMed Disclaimer

Figures

**Figure 1.. Deep reinforcement learning**
(a) A formulation of reinforcement learning problems. In reinforcement learning, an agent learns what action to take in a given state in order to maximize the cumulative sum of future rewards. In video games such as in an Atari game (here the Space Invader game is shown), an agent chooses which action (a(t), joystick turn, button press) to take based on the current state (s(t), pixel images). The reward (r(t)) is defined as the points that the agent or player earns. After David Silver’s lecture slide (https://www.davidsilver.uk/teaching/). (b) Structure of deep Q-network (DQN). A deep artificial neural network (more specifically, a convolutional neural network) takes as input a high-dimensional state vector (pixel images of 4 consecutive Atari game frames) along with sparse scalar rewards, and returns as output a vector corresponding to the value of each possible action given that state (called action values or Q-values and denoted Q(*s, a*)). The agent chooses actions based on these Q-values. To improve performance, the original DQN implemented a technique called “experience replay,” whereby a sequence of events are stored in a memory buffer and replayed randomly during training [2]. This helped remove correlations in the observation sequence, which had previously prevented RL algorithms from being used to train neural networks. Modified after [2]. (c) Difference between traditional and distributional reinforcement learning. Distributional DQN estimates a complete reward distribution for each allowable action. Modified after [6]. (d) Performance of different RL algorithms in DQN. Gray, DQN using a traditional RL algorithm [2]. Light blue, DQN using a categorical distributional RL algorithm (C51 algorithm [6]). Blue, DQN using a distributional RL based on quantile regression [7]. Modified after [7].

**Figure 2.. Learning rules of distributional RL (quantile and expectile regression)**
(a) The standard Rescorla-Wagner learning rule converges to the mean of the reward distribution. (b) Modifying the update rule to use only the sign of the prediction error causes the associated value predictor to converge to the median of the reward distribution. (**c-d**) Adding diversity to the learning rates alongside a binarized update rule that follows the sign of the prediction error causes a family of value predictors to converge to quantiles of the reward distribution. More precisely, the value $q_{τ_{i}}$ to which predictor i converges is the τ_i-th quantile of the distribution, where τ_i is given by $\frac{α_{i}^{+}}{α_{i}^{+} + α_{i}^{-}}$ . This is illustrated for both unimodal (c) and bimodal (d) distributions. (e) The cumulative distribution function (CDF) is a familiar representation of a probability distribution. (f) By transposing this representation, we get the quantile function, or inverse CDF (left). Uniformly-spaced quantiles cluster in regions of higher probability density (right). Together, these quantiles encode the reward distribution in a non-parametric fashion. (**g-h**) Multiplying the prediction error by asymmetric learning rates yields expectiles. Relative to quantiles, expectiles are pulled toward the mean for both unimodal (g) and bimodal (h) distributions.

**Figure 3.. Distributional RL as minimizing a loss function**
(a) The reward probabilities of an example reward distribution. Mean V_mean, median V_median, 0.25-quantile V_{0.25-quantile} and 0.97-expectile V_{0.97-expectile} of this distribution are indicated with different colors. (**b-e**) Loss as a function of the value estimate V (left) when the rewards follow the distribution presented in (a), illustrating that *V = V*_mean minimizes the mean squared error (b), *V = V*_median minimizes the mean absolute error (c), *V = V*_{0.25-quantile} minimizes the quantile regression loss for τ = 0.25 (d), and *V = V*_{0.97-expectile} minimizes the expectile regression loss for τ = 0.97 (e). The right panels show the loss as a function of the RPE δ.

**Figure 4.. The structured diversity of midbrain dopamine neurons is consistent with distributional RL**
(a) Schematic of five different response functions (spiking activity of dopamine neurons) to positive and negative RPEs. In this model, the slope of the response function to positive and negative RPEs corresponds to the learning rates α⁺ *and α*⁻. Diversity in α⁺ *and α*⁻ values results in different asymmetric scaling factors ( $\frac{α_{i}^{+}}{α_{i}^{+} + α_{i}^{-}}$ ). (b) RPE channels (δ_i) with α⁺ < α⁻ overweight negative prediction errors, resulting in pessimistic (blue) value predictors (V_i), while RPE channels with α⁺ > α⁻ overweight positive prediction errors and result in optimistic (red) value predictors. This representation corresponds to the Rescorla-Wagner approach in which RPE and value pairs form separate channels, with no crosstalk between channels with different scaling factors. See Box 2 for the general update rule when this condition is not met. (c) Given that different value predictors encode different reward magnitudes, the corresponding RPE channels will have diverse reversal points (reward magnitudes that elicit no RPE activity relative to baseline). The reversal points correspond to the values V_i of the τ_i-th expectiles of the reward distribution. (d) Reversal points are consistent across two different halves of the data, suggesting that the diversity observed is reliable (P = 1.8 × 10⁻⁵, each point represents a cell). Modified after [8]. (e) Diversity in asymmetric scaling in dopamine neurons tiles the entire [0, 1] interval and is statistically reliable (one-way ANOVA; F(38,234) = 2.93, P = 4 × 10⁻⁷). Modified after [8]. (f) Significant correlation between reversal points and asymmetric scaling in dopamine neurons (each point is a cell, linear regression P = 8.1 × 10⁻⁵). Grey traces show variability over simulations of the distributional TD algorithm run to calculate reversal points in this task. Modified after [8]. (g) Decoding of the reward distribution from dopamine cell activity using an expectile code. The expectiles of the distribution, ${τ_{i}, e_{τ_{i}}}$ , were defined by the asymmetries and reversal points of dopamine neurons. Grey area represents the smoothed reward distribution, light blue traces represent several decoding runs, and the dark blue trace their mean. Modified after [8].

See this image and copyright information in PMC

Cited by

Glutamate inputs send prediction error of reward but not negative value of aversive stimuli to dopamine neurons.
Amo R, Uchida N, Watabe-Uchida M. Amo R, et al. bioRxiv [Preprint]. 2023 Nov 9:2023.11.09.566472. doi: 10.1101/2023.11.09.566472. bioRxiv. 2023. Update in: Neuron. 2024 Mar 20;112(6):1001-1019.e6. doi: 10.1016/j.neuron.2023.12.019. PMID: 37986868 Free PMC article. Updated. Preprint.
Interoception as modeling, allostasis as control.
Sennesh E, Theriault J, Brooks D, van de Meent JW, Barrett LF, Quigley KS. Sennesh E, et al. Biol Psychol. 2022 Jan;167:108242. doi: 10.1016/j.biopsycho.2021.108242. Epub 2021 Dec 20. Biol Psychol. 2022. PMID: 34942287 Free PMC article.
Investigating Transfer Learning in Noisy Environments: A Study of Predecessor and Successor Features in Spatial Learning Using a T-Maze.
Seo I, Lee H. Seo I, et al. Sensors (Basel). 2024 Oct 3;24(19):6419. doi: 10.3390/s24196419. Sensors (Basel). 2024. PMID: 39409459 Free PMC article.
Decision-Making, Pro-variance Biases and Mood-Related Traits.
Lin W, Dolan RJ. Lin W, et al. Comput Psychiatr. 2024 Aug 21;8(1):142-158. doi: 10.5334/cpsy.114. eCollection 2024. Comput Psychiatr. 2024. PMID: 39184228 Free PMC article.
Ketamine rescues anhedonia by cell-type- and input-specific adaptations in the nucleus accumbens.
Lucantonio F, Roeglin J, Li S, Lu J, Shi A, Czerpaniak K, Fiocchi FR, Bontempi L, Shields BC, Zarate CA Jr, Tadross MR, Pignatelli M. Lucantonio F, et al. Neuron. 2025 May 7;113(9):1398-1412.e4. doi: 10.1016/j.neuron.2025.02.021. Epub 2025 Mar 19. Neuron. 2025. PMID: 40112815

See all "Cited by" articles

References

1. LeCun Y et al. (2015) Deep learning. Nature 521, 436–444 - PubMed
1. Mnih V et al. (2015) Human-level control through deep reinforcement learning. Nature 518, 529–533 - PubMed
1. Silver D et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 - PubMed
1. Botvinick M et al. (2019) Reinforcement Learning, Fast and Slow. Trends Cogn. Sci. (Regul. Ed.) 23, 408–422 - PubMed
1. Hassabis D et al. (2017) Neuroscience-Inspired Artificial Intelligence. Neuron 95, 245–258 - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Distributional Reinforcement Learning in the Brain

Affiliations

Distributional Reinforcement Learning in the Brain

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources