. 2024 Jul 12;15(1):5856.

doi: 10.1038/s41467-024-50205-3.

Learning to express reward prediction error-like dopaminergic activity requires plastic representations of time

Ian Cone^{1

2

3}, Claudia Clopath¹, Harel Z Shouval^{4

5}

Affiliations

¹ Department of Bioengineering, Imperial College London, London, UK.
² Department of Neurobiology and Anatomy, University of Texas Medical School at Houston, Houston, TX, USA.
³ Applied Physics Program, Rice University, Houston, TX, USA.
⁴ Department of Neurobiology and Anatomy, University of Texas Medical School at Houston, Houston, TX, USA. harel.shouval@uth.tmc.edu.
⁵ Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA. harel.shouval@uth.tmc.edu.

PMID: 38997276
PMCID: PMC11245539
DOI: 10.1038/s41467-024-50205-3

Learning to express reward prediction error-like dopaminergic activity requires plastic representations of time

Ian Cone et al. Nat Commun. 2024.

. 2024 Jul 12;15(1):5856.

doi: 10.1038/s41467-024-50205-3.

Authors

Ian Cone^{1

2

3}, Claudia Clopath¹, Harel Z Shouval^{4

5}

Affiliations

¹ Department of Bioengineering, Imperial College London, London, UK.
² Department of Neurobiology and Anatomy, University of Texas Medical School at Houston, Houston, TX, USA.
³ Applied Physics Program, Rice University, Houston, TX, USA.
⁴ Department of Neurobiology and Anatomy, University of Texas Medical School at Houston, Houston, TX, USA. harel.shouval@uth.tmc.edu.
⁵ Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA. harel.shouval@uth.tmc.edu.

PMID: 38997276
PMCID: PMC11245539
DOI: 10.1038/s41467-024-50205-3

Abstract

The dominant theoretical framework to account for reinforcement learning in the brain is temporal difference learning (TD) learning, whereby certain units signal reward prediction errors (RPE). The TD algorithm has been traditionally mapped onto the dopaminergic system, as firing properties of dopamine neurons can resemble RPEs. However, certain predictions of TD learning are inconsistent with experimental results, and previous implementations of the algorithm have made unscalable assumptions regarding stimulus-specific fixed temporal bases. We propose an alternate framework to describe dopamine signaling in the brain, FLEX (Flexibly Learned Errors in Expected Reward). In FLEX, dopamine release is similar, but not identical to RPE, leading to predictions that contrast to those of TD. While FLEX itself is a general theoretical framework, we describe a specific, biophysically plausible implementation, the results of which are consistent with a preponderance of both existing and reanalyzed experimental data.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Structure and assumptions of temporal bases for temporal difference learning.**
a Diagram of a simple trace conditioning task. A conditioned stimulus (CS) such as a visual grating is paired, after a delay ΔT, with an unconditioned stimulus (US) such as a water reward. b According to the canonical view, dopaminergic (DA) neurons in the ventral tegmental area (VTA) respond only to the US before training, and only to the CS after training. c In order to represent the delay period, temporal difference (TD) models generally assume neural “microstates” which span the time in between cue and reward. In the simplest case of the complete serial compound (left) the microstimuli do not overlap, and each one uniquely represents a different interval. In general, though (e.g.: microstimuli, right), these microstates can overlap with each other and decay over time. d A weighted sum of these microstates determines the learned value function V(t). e An agent does not know a priori which cue will subsequently be paired with reward. In turn, microstate TD models implicitly assume that all N unique cues or experiences in an environment each have their own independent chain of microstates before learning. f Rewards delivered after the end of a particular cue-specific chain cannot be paired with the cue in question. The chosen length of the chain therefore determines the temporal window of possible associations. g Microstate chains are assumed to be reliable and robust, but realistic levels of neural noise, drift, and variability can interrupt their propagation, thereby disrupting their ability to associate cue and reward.

**Fig. 2. A fixed recurrent neural network as a basis function generator.**
a Schematic of a fixed recurrent neural network (RNN) as a temporal basis. The network receives external inputs and generates states s(t), which act as a basis for learning the value function V(t). Compare to Fig. 1c. b Schematic of the task protocol. Every presentation of C is followed by a reward at a fixed delay of 1000 ms. However, any combination or sequence of irrelevant stimuli may precede the conditioned stimulus C (they might also come after the CS, e.g. A, C, B). c Network activity, plotted along its first two principal components, for a given initial state s₀ and a sequence of presented stimuli A-B-C (red letter is displayed at the time of a given stimulus’ presentation). d Same as c but for input sequence B-A-C. e Overlay of the A-B-C and B-A-C network trajectories, starting from the state at the time of the presentation of C (state s_c). The trajectory of network activity differs in these two cases, so the RNN state does not provide a consistent temporal basis that tracks the time since the presentation of stimulus C.

**Fig. 3. Certain features of experimental results run counter to predictions of TD.**
a Putative temporal basis functions (in red) observed in experiments^, develop over training, shown here schematically. If, after training on a given interval between the conditions stimulus (CS) and the unconditioned stimulus (US), the interval is scaled, the basis-functions also change. Recordings in striatum show these basis-functions scale with the modified interval (top), while in recordings from hippocampus (bottom), they are observed to redistribute to fill up the new interval. b According to the $T D (0)$ (temporal difference learning with no traces) theory, RPE neuron activity (blue) during learning exhibits a backward moving bump, from the time of the US to the time of the CS (left). For $T D (λ)$ (TD with trace decaying at a time constant $λ$ ) the bump no longer appears (right). c A schematic depiction of experiments where there is no observation of a backward shifting bump^,. d The integral of dopamine neuron (DA) activity according to TD theory (left) should be constant over training (for $γ = 1$ , dotted line) or decreasing monotonically for ( $γ < 1$ , blue line). Right, reanalyzed existing experimental data from a trace conditioning task in Coddington and Dudman (2018). The horizontal axis is the training trial, and the vertical axis is the mean activity modulation of DA neuron activity integrated over both the cue and reward periods (relative to baseline). Each blue dot represents a recording period for an individual neuron from either the ventral tegmental area (VTA) or substantia nigra compacta (SNc) (n = 96). The black line is a running average over 10 trials. A bracket with a star indicates blocks of 10 individual cell recording periods (dots) which show a significantly different modulated DA response (integrated over both the cue and reward periods) than that of the first 10 recording periods/cells (Significance with a two-sided Wilcoxon rank sum test, p < 0.05). See also Supplementary Fig. 1. Definitions: temporal difference learning (TD), dopamine neurons (DA).

**Fig. 4. Potential architectures for a flexible temporal basis.**
Three example networks that could implement a FLEX theory. Top, network schematic. Middle, the activity of one example neuron in the network. Bottom, network activity before and after training. Each network initially only has transient responses to stimuli, modifying plastic connections (in blue) during association to develop a specific temporal basis to reward-predictive stimuli. a Feed-orward neural sequences or “chains” could support a FLEX model, if the chain could recruit more members during learning, exclusive to reward-predictive stimuli. b A population of neurons with homogenous recurrent connections have a characteristic decay time that is related to the strength of the weights. The cue-relative time can then be read out by the mean level of activity in the network. c A population of neurons with heterogenous and large recurrent connections (liquid state machine) can represent cue-relative time by treating the activity vector at time t as the “microstate” representing time t (as opposed to the homogenous case, where only mean activity is used).

**Fig. 5. Biophysically Inspired Architecture Allows for Flexible Encoding of Time.**
a Diagram of the model architecture. Core neural architectures (CNAs, visualized here as columns) located in the PFC are selective to certain sensory stimuli (indicated here by color atop the column) via fixed excitatory inputs (conditioned stimulus, CS). Ventral tegmental area dopamine (VTA, DA) neurons receive fixed input from naturally positive valence stimuli, such as food or water reward (unconditioned stimulus, US). DA neuron firing releases dopamine, which acts as a learning signal for both PFC and VTA. Solid lines indicate fixed connections, while dotted lines indicate learned connections. b, c Schematic representation of data adapted from Liu et al. **. b** Timers learn to characteristically decay at the time of cue-predicted reward. c Messengers learn to have a firing peak at the time of cue-predicted reward. Definitions: prefrontal cortex (PFC), ventral tegmental area (VTA).

**Fig. 6. CS-evoked and US-evoked model dopamine responses evolve on different timescales.**
The model is trained for 30 trials while being presented with a conditioned stimulus (CS) at 100 ms and a reward at 1100 ms. a Mean firing rates for the core neural architecture (CNA) (see inset for colors; T = Timers, M = Messengers, Inh = Inhibitory), for three different stages of learning. b Mean firing rate over all DA neurons taken at the same three stages of learning. Firing above or below thresholds (dotted lines) evokes positive or negative D(t) in PFC. c Evolution of mean synaptic weights over the course of learning. Top, middle, and bottom, mean strength of Timer→Timer, CS → DA (conditioned stimulus → dopamine), and Messenger→GABA synapses, respectively. d Area under receiver operating characteristic (auROC, see Methods) for all VTA neurons in our model for 15 trials before (left, unconditioned stimulus, US, only), 15 trials during (middle, CS + US, conditioned + unconditioned stimulus), and 15 trials after (right, CS + US) learning. Values above and below 0.5 indicate firing rates above and below the baseline distribution. Definitions: dopamine (DA), prefrontal cortex (PFC), inhibitory neurons (GABA).

**Fig. 7. FLEX Model Dynamics Diverge from those of TD Learning.**
Dynamics of both temporal difference (TD) learning with $λ = 0.975$ (TD(λ)) and FLEX when trained with the same conditioning protocol as shown in Fig. 6. a Dopaminergic release D(t) for FLEX (left), and RPE for TD(λ) (right), over the course of training. b Total/integrated reinforcement D(t) during a given trial of training in FLEX (black), and sum total RPE in three instances of TD(λ) with discounting parameters γ = 1, γ = .99, and γ = .95 (red, orange, and yellow, respectively). Shaded areas indicate functionally different stages of learning in FLEX. The learning rate in our model is reduced in this example, to make a direct comparison with TD(λ). 100 trials of US-only presentation (before learning) are included for comparison with subsequent stages.

**Fig. 8. FLEX model reconciles differing experimental phenomena observed during sequential conditioning.**
Results from “sequential conditioning”, where sequential neutral stimuli CS1 and CS2 (conditioned stimuli 1 and 2) are paired with delayed reward (US). a Visualization of the protocol. In this example, the US is presented starting at 1500 ms, with CS1 is presented starting at 100 ms, and CS2 is presented starting at 800 ms. b Mean firing rates over all dopamine (DA) neurons, for four distinctive stages in learning – initialization(i), acquisition(ii), reward depression(iii), and serial transfer of activation(iv). c Schematic illustration of experimental results from recorded dopamine neurons, labeled with the matching stage of learning in our model. c DA neuron firing before (top), during (middle), and after (bottom) training, wherein two cues (0 s and 4 s) were followed by a single reward (6 s). Adapted from Pan et al. . d DA neuron firing after training wherein two cues (instruction, trigger) were followed by a single reward. Adapted from Schultz et al. .

See this image and copyright information in PMC

Update of

Learning to Express Reward Prediction Error-like Dopaminergic Activity Requires Plastic Representations of Time.
Cone I, Clopath C, Shouval HZ. Cone I, et al. Res Sq [Preprint]. 2023 Sep 19:rs.3.rs-3289985. doi: 10.21203/rs.3.rs-3289985/v1. Res Sq. 2023. Update in: Nat Commun. 2024 Jul 12;15(1):5856. doi: 10.1038/s41467-024-50205-3. PMID: 37790466 Free PMC article. Updated. Preprint.

Cited by

Negative affect-driven impulsivity as hierarchical model-based overgeneralization.
Okan A, Hallquist MN. Okan A, et al. Trends Cogn Sci. 2025 May;29(5):407-420. doi: 10.1016/j.tics.2025.01.002. Epub 2025 Feb 6. Trends Cogn Sci. 2025. PMID: 39919952 Review.
Dopamine release plateau and outcome signals in dorsal striatum contrast with classic reinforcement learning formulations.
Kim MJ, Gibson DJ, Hu D, Yoshida T, Hueske E, Matsushima A, Mahar A, Schofield CJ, Sompolpong P, Tran KT, Tian L, Graybiel AM. Kim MJ, et al. Nat Commun. 2024 Oct 14;15(1):8856. doi: 10.1038/s41467-024-53176-7. Nat Commun. 2024. PMID: 39402067 Free PMC article.

References

1. Sutton, R. S. & Barto, A. G. Reinforcement Learning, Second Edition: An Introduction. (MIT Press, 2018).
1. Glickman SE, Schiff BB. A biological theory of reinforcement. Psychol. Rev. 1967;74:81–109. doi: 10.1037/h0024290. - DOI - PubMed
1. Lee D, Seo H, Jung MW. Neural basis of reinforcement learning and decision making. Annu. Rev. Neurosci. 2012;35:287–308. doi: 10.1146/annurev-neuro-062111-150512. - DOI - PMC - PubMed
1. Chersi F, Burgess N. The cognitive architecture of spatial navigation: hippocampal and striatal contributions. Neuron. 2015;88:64–77. doi: 10.1016/j.neuron.2015.09.021. - DOI - PubMed
1. Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275:1593–1599. doi: 10.1126/science.275.5306.1593. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learning to express reward prediction error-like dopaminergic activity requires plastic representations of time

Affiliations

Learning to express reward prediction error-like dopaminergic activity requires plastic representations of time

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources