Dopamine neurons learn to encode the long-term value of multiple future rewards

Kazuki Enomoto¹, Naoyuki Matsumoto, Sadamu Nakai, Takemasa Satoh, Tatsuo K Sato, Yasumasa Ueda, Hitoshi Inokawa, Masahiko Haruno, Minoru Kimura

Affiliations

PMID: 21896766
PMCID: PMC3174584
DOI: 10.1073/pnas.1014457108

Dopamine neurons learn to encode the long-term value of multiple future rewards

Kazuki Enomoto et al. Proc Natl Acad Sci U S A. 2011.

. 2011 Sep 13;108(37):15462-7.

doi: 10.1073/pnas.1014457108. Epub 2011 Sep 6.

Authors

Kazuki Enomoto¹, Naoyuki Matsumoto, Sadamu Nakai, Takemasa Satoh, Tatsuo K Sato, Yasumasa Ueda, Hitoshi Inokawa, Masahiko Haruno, Minoru Kimura

Affiliation

¹ Department of Physiology, Kyoto Prefectural University of Medicine, Kyoto 602-8566, Japan.

PMID: 21896766
PMCID: PMC3174584
DOI: 10.1073/pnas.1014457108

Abstract

Midbrain dopamine neurons signal reward value, their prediction error, and the salience of events. If they play a critical role in achieving specific distant goals, long-term future rewards should also be encoded as suggested in reinforcement learning theories. Here, we address this experimentally untested issue. We recorded 185 dopamine neurons in three monkeys that performed a multistep choice task in which they explored a reward target among alternatives and then exploited that knowledge to receive one or two additional rewards by choosing the same target in a set of subsequent trials. An analysis of anticipatory licking for reward water indicated that the monkeys did not anticipate an immediately expected reward in individual trials; rather, they anticipated the sum of immediate and multiple future rewards. In accordance with this behavioral observation, the dopamine responses to the start cues and reinforcer beeps reflected the expected values of the multiple future rewards and their errors, respectively. More specifically, when monkeys learned the multistep choice task over the course of several weeks, the responses of dopamine neurons encoded the sum of the immediate and expected multiple future rewards. The dopamine responses were quantitatively predicted by theoretical descriptions of the value function with time discounting in reinforcement learning. These findings demonstrate that dopamine neurons learn to encode the long-term value of multiple future rewards with distant rewards discounted.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1.**
Behavioral paradigms of multistep actions for rewards in monkeys. (A) Sequence of events during the multistep choice task. (B) Schematically illustrated structure of the three-step choice trials to obtain three rewards at different times. (C) Average correct choice rates (mean and SD, 29 d in monkey SK and 35 d in monkey CC, during the advanced stage of learning) against five types of three-step choice trials (N1, N2, N3, R1, and R2) in two monkeys.

**Fig. 2.**
Reward expectation during multistep actions measured by anticipatory licking. (A) The anticipatory licking movements for the 800-ms period before the reinforcer beeps (*SI Text*) in monkey CC are color-coded. (B) The average proportion of trials in which the amplitude of anticipatory licking exceeded the threshold (50% maximum) is plotted against the time to the reinforcer beeps in the two monkeys. (C) Bar graphs of the normalized licking duration (100–800 ms period before the beeps, mean and SEM; 32 sessions in monkey BT and 75 sessions in monkey CC; *SI Text*) against trial type. The average reward probability (dashed green line) and the best-fit value function derived from reinforcement learning algorithm (solid black line, γ = 0.65, R = 0.71, P = 0.29 in monkey BT; γ = 0.66, R = 0.74, P = 0.16 in monkey CC) are superimposed. (D) The parameter space landscape of correlation coefficients between the experimental and simulated licking duration in which R is plotted against γ. The values of the second derivatives of R are −27 for monkey BT and −6.1 for monkey CC.

**Fig. 3.**
Dopamine neurons encode long-term value as a sum of expected future rewards. (A) Example responses of a dopamine neuron to the illumination of the start cues in individual trials of the three-step choice task in monkey CC. The bin size of the spike density histogram is 15 ms. Hatched areas are the time windows for the analyses shown in B. (B) Bar graphs of ensemble average of dopamine responses (mean and SEM) above the baseline in monkey CC during the time windows (50–290 ms after the start cue) shown in A. The best-fit value functions (γ = 0.65, R = 0.71, P = 0.18, solid line) and reward probability of trials (γ = 0.00, R = 0.29, P = 0.68, dashed line) are superimposed. The numbers in parentheses represent the reward probability for the given trial type. (C) Same as in B but for monkey SK (40–240 ms after the start cue). The best-fit value function (γ = 0.31, R = 0.99, P < 0.01) is superimposed (Fig. S2B). (D) Same as in B but for monkey SK (70–260 ms after the start cue) in the two-step choice task with a fixed amount of reward. The best-fit value function is superimposed (γ = 0.68, R = 0.71, P = 0.29). (E) Plots of the parameter space landscape of correlation coefficients. The value of second derivative of R is −5.8 for monkey CC, −2.2 for monkey SK in the three-step choice task, and −36 for monkey SK in the two-step choice task.

**Fig. 4.**
Development of value coding by dopamine neurons through learning. (A) The adaptive increase in the correct choice rate in N3 trials through the learning of the three-step choice task for 51 to 61 d. The advanced stage of learning (correct choice rate > 0.8) is indicated by shading. (B) Bar graphs of the average duration of anticipatory licking on day 10 (early stage) and day 37 (advanced stage) of learning in monkey CC (mean and SEM, solid arrows in A). The best-fit value functions in the early stage (γ = 0.05, R = 0.90, P < 0.05) and in the advanced stage (γ = 0.73, R = 0.69, P = 0.20) are superimposed. (C) Bar graphs of start cue responses of an example neuron recorded on day 12 (early stage) and of another neuron on day 29 (advanced stage) in monkey SK (dashed arrows in A). Superimposed line plots are the best-fit value functions (γ = 0.04, R = 0.83, P = 0.08, day 12; γ = 0.38, R = 0.91, P < 0.05, day 29). (D) Plots of the parameter space landscape of correlation coefficients of the data in C. The values of second derivatives of R are −1.7 during the early stage and −3.0 during the advanced stage. (E) Bar graphs of ensemble average responses of 25 dopamine neurons (mean and SEM). Superimposed plots show the best-fit value function (γ = 0.00, R = 0.72, P = 0.18). Ensemble average responses during the advanced stage are shown in Fig. 3 B and C.

See this image and copyright information in PMC

References

1. Sutton RS, Barto AG. Reinforcement Learning. Cambridge, MA: MIT Press; 1998.
1. Fiorillo CD, Tobler PN, Schultz W. Discrete coding of reward probability and uncertainty by dopamine neurons. Science. 2003;299:1898–1902. - PubMed
1. Morris G, Arkadir D, Nevet A, Vaadia E, Bergman H. Coincident but distinct messages of midbrain dopamine and striatal tonically active neurons. Neuron. 2004;43:133–143. - PubMed
1. Morris G, Nevet A, Arkadir D, Vaadia E, Bergman H. Midbrain dopamine neurons encode decisions for future action. Nat Neurosci. 2006;9:1057–1063. - PubMed
1. Satoh T, Nakai S, Sato T, Kimura M. Correlated coding of motivation and outcome of decision by dopamine neurons. J Neurosci. 2003;23:9913–9923. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Dopamine neurons learn to encode the long-term value of multiple future rewards

Affiliation

Dopamine neurons learn to encode the long-term value of multiple future rewards

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources