Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 23;378(6626):eabq6740.
doi: 10.1126/science.abq6740. Epub 2022 Dec 23.

Mesolimbic dopamine release conveys causal associations

Affiliations

Mesolimbic dopamine release conveys causal associations

Huijeong Jeong et al. Science. .

Abstract

Learning to predict rewards based on environmental cues is essential for survival. It is believed that animals learn to predict rewards by updating predictions whenever the outcome deviates from expectations, and that such reward prediction errors (RPEs) are signaled by the mesolimbic dopamine system-a key controller of learning. However, instead of learning prospective predictions from RPEs, animals can infer predictions by learning the retrospective cause of rewards. Hence, whether mesolimbic dopamine instead conveys a causal associative signal that sometimes resembles RPE remains unknown. We developed an algorithm for retrospective causal learning and found that mesolimbic dopamine release conveys causal associations but not RPE, thereby challenging the dominant theory of reward learning. Our results reshape the conceptual and biological framework for associative learning.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.. An algorithm for uncovering causal associations in an environment.
A. Animals can learn cue-reward associations either prospectively (“does reward follow cue?”) or retrospectively (“does cue precede reward?”). B. The dominant model for cue-reward learning is temporal difference reinforcement learning, which learns the prospective association between a cue and reward, i.e., a measure of how often the reward follows the cue (cue value). To this end, the algorithm looks forward from a cue to predict upcoming rewards. When this prediction is incorrect, the original prediction is updated using a reward prediction error (RPE). The simplest of this family of models is the Rescorla-Wagner model which does not consider the delay between cue and reward. Temporal difference reinforcement learning (TDRL) algorithms extend this simple model to account for the cue-reward delay by modeling it as a series of states that measure time elapsed since stimulus onset. Two such examples are shown. C. Here, we propose an algorithm which retrospectively learns the causes of meaningful stimuli such as rewards (Fig S1-4). Because causes precede outcomes, causal learning only requires a memory trace of the past. In our mechanistic model, a memory trace of prior stimuli is maintained using an exponentially-decaying eligibility trace for a stimulus (78), which allows the online calculation of the experienced rate of this stimulus (79). We hypothesized that mesolimbic dopamine activity signals ANCCR, a quantity that allows measuring whether an experienced stimulus is a meaningful causal target.
Fig. 2.
Fig. 2.. The retrospective causal algorithm produces a signal similar to temporal difference reward prediction error (RPE) in simulations of previous experiments.
A. During simple conditioning of a cue-reward association, ANCCR appears qualitatively similar to an RPE signal, being low before and high after learning for the cue, whereas being high before and low after learning for the reward, and negative after omission of an expected reward. All error bars are standard error of the mean throughout the manuscript. B. For probabilistic rewards, ANCCR produces qualitatively similar responses as RPE for cue, reward, and omission. Note that in B, animals were never trained on a fully predicted reward. Slight differences in omission responses from A to B result from this difference. C. For trial-by-trial changes in reward magnitude, ANCCR produces reward responses similar to positive and negative RPEs (similar to (80)). D-F. Simulations of ANCCR learning produces behavior consistent with conditioned inhibition (D), blocking (E), and overexpectation (F). G. Simulated inhibition of dopamine at reward time in cue-reward conditioning produces extinction of learned behavior (similar to (55)). H. Simulation of dopamine inhibition at reward time produces trial-by-trial changes in behavior (similar to (81)). I. Simulation of unblocking due to dopamine activation at reward during blocking (similar to (14)).
Fig. 3.
Fig. 3.. The dynamics of dopamine responses to unpredicted rewards are consistent with ANCCR, but not TDRL RPE.
A. For the first two tests, we gave experimentally naïve mice random unpredictable sucrose rewards immediately following head-fixation while recording sub-second dopamine release in NAcc using the optical dopamine sensor, dLight 1.3b (Methods). Animals underwent multiple sessions with 100 rewards each (n=8 mice). B. Theoretical predictions for both models. Test 1: As a naïve animal receives unpredicted rewards, the RPE model predicts high responses since the rewards are unpredicted. Nevertheless, since the inter-reward interval (IRI) states acquire value over repeated experience, the RPE at reward will reduce with repeated experience. On the other hand, ANCCR predicts low reward responses early since an experimentally naïve animal will have no prior expectation/eligibility trace of sucrose early in the task but will subsequently approach a signal that is ~1 times the incentive value of sucrose. Test 2: The reward response following a short IRI will be larger in the RPE model because the reward was received earlier than expected, thereby resulting in a negative correlation between dopamine reward response and the previous IRI. However, since ANCCR has a subtractive term proportional to the baseline reward rate (Mr←- in the figure), and baseline reward rate reduces with longer IRI, ANCCR predicts a positive correlation between dopamine reward response and the previous IRI. C. Simulations confirming the intuitive reasoning from B for Test 1. CSC and MS stand for complete serial compound and microstimulus, respectively. (one sample t-test against a null of zero; t(99) = RPE (CSC), −65.74; RPE (MS), −27.57; ANCCR, 18.60; Two-tailed p values = RPE (CSC), 1.7×10−83, RPE (MS), 3.0×10−48, ANCCR, 4.5×10−34; n=100 simulations). D. Licking and dopamine response from two example mice (rewards with less than 3 s previous IRI were excluded to avoid confounding by ongoing licking responses). Though not our initial prediction, ANCCR can even account for the negative unpredicted sucrose response from Animal 2 (Fig S8). E. Quantification of correlation between dopamine response and number of rewards. Left panel shows the data from an example animal and the right panel shows the population summary across all animals (one sample t-test against a null of zero; t(7) = 4.40, two-tailed p = 0.0031; n=8 animals). Reward response was defined as the difference of area under curve (AUC) of fluorescence trace between reward and baseline period (Methods). F. Simulations confirming the intuitive reasoning from B for Test 2 (one sample t test against a null of zero; t(99) = RPE (CSC), −1.7×103, RPE (MS), −151.28, ANCCR, 335.03; Two-tailed p values = RPE (CSC), 5.0×10−223, RPE (MS), 6.3×10−119, ANCCR, 4.8×10−153, n=100 iterations). G. Quantification of correlation between dopamine response and the previous IRI for an example session (left) and the population of all animals (one sample t-test against a null of zero; t(7) = 5.95, two-tailed p = 5.7×10−4, n=8 animals). The average correlation across all sessions for each animal is plotted in the bar graph.
Fig. 4.
Fig. 4.. The dynamics of dopamine responses during cue-reward learning are consistent with ANCCR, but not TDRL RPE.
A. TDRL predicts that dopaminergic and behavioral learning will be tightly linked during learning. However, the causal learning model proposes that there is no one-to-one relationship between behavioral and dopaminergic learning. B. Schematic of a cue-reward learning task in which one auditory tone predicted reward (labeled CS+) and another had no predicted outcome (labeled CS−). C. Licking and dopamine measurements from an example animal showing that the dopamine response to CS+ significantly precedes the emergence of anticipatory licking (Days 4 vs 12 respectively, shown by the arrows). D. Schematic to show a cumulative sum (cumsum) plot of artificial time-series data. A time-series that increases over trials appears below the diagonal in the cumsum plot with an increasing slope over trials, and one that decreases over trials appears above the diagonal. Further, a sudden change in timeseries appears as a sudden change in slope in the cumsum plot. E, F. Dopamine cue response considerably leads behavior across animals. Each line is one animal, with the blue line corresponding to the example from C. Behavioral learning is much more abrupt than dopaminergic learning (paired t test for abruptness of change; t(6) = 9.06; two-tailed p = 1.0×10−4; paired t test for change trial; t(6) = −2.93; two-tailed p = 0.0263; n=7 animals). G. Anticipatory licking and dopamine release in an example animal after increasing the cue duration from 2 s to 8 s while maintaining a 1 s trace interval and a long ITI (~33 s). Trials are shown in chronological order from bottom to top. The three vertical dashed lines indicate cue onset, cue offset, and reward delivery (also in J and O). H-I. Behavior is learned abruptly by all animals, but the dopaminergic cue response shows little to no change. The dashed vertical line is the trial at which the experimental condition transitions (in H, K, and P). We tested for the lack of change by showing that the Akaike Information Criterion (AIC) is similar between a model assuming change and a model assuming no change. Paired t test for abruptness of change; t(6) = 22.92; two-tailed p = 4.52×10−7; one-sample t test for ΔAIC against a null of zero; t(6) = 7.49 for lick, −0.86 for dopamine; two-tailed p = 2.9×10−4 for lick, 0.4244 for dopamine (n=7 animals). J. The dopaminergic cue response of an example animal remains positive well after it learns extinction of the cue-reward association. K-L. Across all animals, the dopaminergic cue response remains significantly positive despite abrupt behavioral learning of extinction (paired t test for abruptness of change; t(6) = 5.67; two-tailed p = 0.0013; paired t test for change trial; t(6) = −2.40; two-tailed p = 0.0531; n=7 animals). M. Experiment to reduce retrospective association while maintaining prospective association. N. Two experiments that show specific reduction in either prospective or retrospective association. O. Licking and dopamine release from an example animal. P. Dopamine cue response reduces more rapidly during the background reward experiment in which the cue is followed consistently by a reward than during extinction in which there is no reward (paired t test; t(6) = −3.51; two-tailed p = 0.0126; n=7 animals).
Fig. 5.
Fig. 5.. Dopamine responses in a “trial-less” cue-reward task reflect causal structure like ANCCR, but unlike TDRL RPE.
A. A “trial-less” cue-reward learning task. Here, a cue (250 ms duration) is consistently followed by a reward at a fixed delay (3 s trace interval). However, the cues themselves occur with an exponential inter-cue interval with a 33 s mean. B. Confirmation of these intuitions based on simulations (Methods) (One sample t test against a null of zero; t(99) = RPE (CSC), −114.74; RPE (MS), −181.32; ANCCR, 322.53; Two-tailed p values = RPE (CSC), 4.1×10−107; RPE (MS), 1.1×10−126; ANCCR, 2.1×10−151; n=100 iterations). Reward responses are predicted to be positive by both models (One sample t test against a null of one; t(99) = RPE (CSC), 87.67; RPE (MS), 62.86; ANCCR, 16.78; Two-tailed p values = RPE (CSC), 1.2×10−95; RPE (MS), 1.3×10−81; ANCCR, 1.1×10−30; n=100 iterations). C. Example traces from one animal showing that the dopamine response to the intermediate cue is positive. D. Quantification of the experimentally observed ratio between the intermediate cue response and the previous cue response (One sample t test against a null of zero; t(6) = 6.64, two-tailed p value = 5.6×10−4; n=7 animals), and reward response (One sample t test against a null of one; t(6) = 2.95; two-tailed p value = 0.0256; n=7 animals).
Fig. 6.
Fig. 6.. No backpropagation of dopamine signals during learning.
A. Schematic of learning dynamics for pre-reward dopamine dynamics based on RPE or ANCCR signaling. Schematic was inspired from (50). If there is a temporal shift, the difference in dopamine response between early and late phases of a trial will be negative in the initial trials. B. Dynamics of dopamine response during early and late periods within a trial over training (left), and their difference during first 100 trials. C. Simulated dynamics for dopamine responses to cues (CS1 and CS2) during sequential conditioning (left), and averaged CS2 response during last 50 trials (right). D. Experimental data showing dynamics of dopamine responses to cues (left). Response difference between two cues during early phase of learning (middle; similar to Fig6B right) and CS2 response during late phase of learning (right, similar to Fig6C right). E. Schematic of optogenetic inhibition experiment during sequential conditioning for both experimental DAT-Cre animals receiving inhibition and control wild type animals receiving light but no inhibition. Animals received laser from CS2 until reward throughout conditioning. F. Measured licking and dopamine responses on the first session of conditioning from an example experimental animal, showing robust inhibition. G. Quantification of magnitude of inhibition during CS2 presentation prior to reward, and reward response. Both responses are measured relative to pre-CS1 baseline. H. Predicted dopamine responses using simulations of RPE or ANCCR. I. Experimental data showing CS1 response (left) and anticipatory licking (right) across sessions. Here, n represents the last session.

Comment in

References

    1. Rescorla RA, Wagner AR, A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Classical conditioning II: Current research and theory. 2, 64–99 (1972).
    1. Niv Y, Schoenbaum G, Dialogues on prediction errors. Trends Cogn Sci. 12, 265–272 (2008). - PubMed
    1. Niv Y, Reinforcement learning in the brain. Journal of Mathematical Psychology. 53, 139–154 (2009).
    1. Cohen JY, Haesler S, Vong L, Lowell BB, Uchida N, Neuron-type-specific signals for reward and punishment in the ventral tegmental area. Nature. 482, 85–88 (2012). - PMC - PubMed
    1. Schultz W, Dayan P, Montague PR, A Neural Substrate of Prediction and Reward. Science. 275, 1593–1599 (1997). - PubMed