Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr:38:74-82.
doi: 10.1016/j.cobeha.2020.10.010.

What is dopamine doing in model-based reinforcement learning?

Affiliations

What is dopamine doing in model-based reinforcement learning?

Thomas Akam et al. Curr Opin Behav Sci. 2021 Apr.

Abstract

Experiments have implicated dopamine in model-based reinforcement learning (RL). These findings are unexpected as dopamine is thought to encode a reward prediction error (RPE), which is the key teaching signal in model-free RL. Here we examine two possible accounts for dopamine's involvement in model-based RL: the first that dopamine neurons carry a prediction error used to update a type of predictive state representation called a successor representation, the second that two well established aspects of dopaminergic activity, RPEs and surprise signals, can together explain dopamine's involvement in model-based RL.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Neuron numbers, signal dimension, and algorithm.
A) Number of cortico-striatal, striatal and midbrain dopamine neurons in the rat brain, estimates from 43,44,48. The disparity in neuron numbers strongly suggests that the signal carried by dopamine is much lower dimensional than the cortical input to striatum. B) Diagram illustrating temporal difference (TD) value learning using linear function approximation. A vector st of state features active at time t, is multiplied by a weight vector θt to give a vector vt whose elements are the contributions made by each state feature to the scalar value vt of the state. The scalar reward prediction error δt, used to update the weights, is computed as δt = rt + γvt − vt−1, where rt is the immediate reward at time t, γ is a discount rate and vt−1 the value at the previous time step. Irrespective of the dimension of the state representation, the prediction error is a scalar, hence if st is represented by cortical neurons, θt by their synapses in striatum, and δt by dopamine, the algorithm is consistent with much smaller signal dimension for dopamine relative to cortical input. C) Temporal difference learning of a feature based successor representation (SR). The state vector st is multiplied by a weight matrix Θt to give a matrix Mt whose elements mti,j are the contributions made by feature i of the current state to predictions about the future occurrence of feature j. Summing the contribution of all features of the current state gives the vector mt, the SR for the current state. As the SR is a prediction of future state features, rather than rewards as in TD value learning, the feature vector st takes the role played by reward rt in value learning. As the prediction is a vector of dimension given by the number of state features, so is the prediction error δt. This algorithm does not appear consistent with the massive difference in signal dimension between cortical and dopaminergic input to striatum. D) The one dimensional prediction error signal in standard TD value learning is inconsistent with the observed heterogeneity of dopaminergic responses. One possible explanation is that parallel cortico-basal-ganglia loops (labelled 1 & 2) independently learn value estimates, each using only a subset of state features. For clarity we have shown the extreme case of no crosstalk between loops. E) Recent data suggest that rather than predicting scalar reward, the basal ganglia predict multiple axes of reinforcement (reward and threat) in loops involving different striatal regions (nucleus accumbens and tail of striatum). Here we have shown two loops which use partially overlapping sets of state features to predict different components of a multi-dimensional reinforcement vector rt.
Figure 2
Figure 2. How do internal predictive models contribute to RPE signals?
Diagrams showing different ways in which predictive internal models could contribute value information to dopaminergic RPEs. A) Model-based planning uses roll-outs along different possible future trajectories (black) from the current state (red) to calculate the long run value associated with different options. The short latency of dopamine responses likely precludes repeated roll-outs using new information from occurring in time to inform the RPE, though offline planning during rest or sleep may update cached values that inform future RPEs. B) The successor representation caches a diffuse prediction of likely future states given the current state, which averages over previously experienced behavioural trajectories. This allows for rapid computation of new long run values when the immediate rewards associated with states change, but see 38 for some limitations. C) A minimal rollout of only the most probably future trajectory could potentially provide useful value information rapidly in near deterministic settings. D) In addition to predicting the future, internal models may help to disambiguate between possible current states that can only be disambiguated by considering the extended history. If the different states have different cached values associated with them, such state inference will affect RPEs.

References

    1. Schultz W, Dayan P, Montague PR. A Neural Substrate of Prediction and Reward. Science. 1997;275:1593–1599. - PubMed
    1. Sutton RS, Barto AG. Reinforcement learning: An introduction. The MIT press; 1998.
    1. Lerner TN, et al. Intact-Brain Analyses Reveal Distinct Information Carried by SNc Dopamine Subcircuits. Cell. 2015;162:635–647. - PMC - PubMed
    1. Parker NF, et al. Reward and choice encoding in terminals of midbrain dopamine neurons depends on striatal target. Nat Neurosci. 2016;19:845–854. - PMC - PubMed
    1. Menegas W, Babayan BM, Uchida N, Watabe-Uchida M. Opposite initialization to novel cues in dopamine signaling in ventral and posterior striatum in mice. eLife. 2017;6:e21886. - PMC - PubMed

LinkOut - more resources