The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces

Marco A Huertas¹, Sarah E Schwettmann², Harel Z Shouval¹

Affiliations

¹ Department of Neurobiology and Anatomy, University of Texas Medical School Houston, TX, USA.
² Department of Computational and Applied Mathematics, Rice UniversityHouston, TX, USA; Department of Brain and Cognitive Sciences, Massachusetts Institute of TechnologyCambridge, MA, USA.

PMID: 28018206
PMCID: PMC5156839
DOI: 10.3389/fnsyn.2016.00037

The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces

Marco A Huertas et al. Front Synaptic Neurosci. 2016.

. 2016 Dec 15:8:37.

doi: 10.3389/fnsyn.2016.00037. eCollection 2016.

Authors

Marco A Huertas¹, Sarah E Schwettmann², Harel Z Shouval¹

Affiliations

¹ Department of Neurobiology and Anatomy, University of Texas Medical School Houston, TX, USA.
² Department of Computational and Applied Mathematics, Rice UniversityHouston, TX, USA; Department of Brain and Cognitive Sciences, Massachusetts Institute of TechnologyCambridge, MA, USA.

PMID: 28018206
PMCID: PMC5156839
DOI: 10.3389/fnsyn.2016.00037

Abstract

The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for expressing the LTP and LTD traces? Here we expand on our previous model to include several neuromodulators, and illustrate through various examples how different these contribute to learning reward-timing within a wide set of training paradigms and propose further roles that multiple neuromodulators can play in encoding additional information of the rewarding signal.

Keywords: LTD; LTP; eligibility-trace; neuromodulator; reinforcement-learning; reward; synaptic plasticity; timing.

PubMed Disclaimer

Figures

**Figure 1**
**Examples of trace activity for different time-dependent Hebbian functions**. **(A)** Temporal profile of synaptic eligibility traces for a square Hebbian stimulus, illustrated by the dashed red line, with onset at time t = 0 and offset at time t = t_stim. Traces rise with a time constant ${\tilde{τ}}^{a}$ (Equation 5) to an upper steady state ${\tilde{T}}^{a}$ (Equation 4) and decay to zero with a slower time constant τ^a. **(B)** A smooth, time-varying Hebbian function (dashed red line) representing the contributions of pre- and post-synaptic cell activity and the resulting synaptic eligibility trace obtained from Equation (7) (blue). After the Hebbian term has decayed to zero, the dynamic of the trace follows an exponential decay (green).

**Figure 2**
**The CRL in a recurrent neural network**. **(A)** With a two-trace rule, an integrate-and-fire network with excitatory recurrent connections can learn to generate different interval times. **(B)** In a two-population network with fixed inhibitory connections, plastic excitatory connections, and external noise, a large range of interval times can be learned as well. **(C)** Illustration of the learning process in a recurrent network. Initially (Trial 1, left) the network activity (top panel) quickly decays and so do the LTP (green) and LTD (red) associated eligibility traces (bottom panel). At the time of reward (vertical dashed line) the LTP trace dominates and recurrent connections are strengthened resulting in longer lasting network activity. After 5 trials (middle panel) the network activity lasts for longer, and traces reach saturation (dashed lines). At the time of reward LTP traces still dominate, resulting in an extension of the network time constant. After 10 trials (right), network activity extends almost to the time of reward, and the traces are equal at the time of reward. Therefore, there is no change in the strength of lateral connection and the synaptic plasticity reaches a steady state.

**Figure 3**
**Ramp reward paradigm and neuromodulator release profile**. **(A)** Single neuromodulator case. The reward magnitude as a function of response time is shown by the red line in the upper panel. The longer the subject waits, the bigger the reward, until a time t_max after which no reward is received. It is assumed that the magnitude of the single neuromodulator follows this profile as well. The black curve in the upper panel shows schematically the distribution of response times after learning is complete. In the lower panel a situation in which the maximum reward time (t_max) is decreased (blue) while the action time responses have not yet changed in response to the new reward paradigm. Because the actions occur in the unrewarded region, the network cannot learn this new condition. **(B)** Two neuromodulator case. The reward paradigm is the same as in **(A)**, but the two neuromodulators respond differently. The two neuromodulator profiles are depicted by different colors as defined in the legend (cyan for LTP and magenta for LTD related traces). For the long reward time (upper panel), both neormodulators have the same profile if there is a reward. However, if there is an action but no reward, only the LTD related neuromodulator is released. This becomes more apparent when the reward time is shifted to a shorter duration (lower panel). Here, most responses are not initially rewarded and only the LTD related neuromodulator is released, triggered by the action.

**Figure 4**
**Training with a ramp reward paradigm**. **(A)** In the ramp reward paradigm, the animal gets a reward that depends on the timing of its action. The magnitude of reward increases linearly with the timing of action, until a maximal time t_max after which no reward is obtained (red line). Rodents trained with this paradigm (Namboodiri et al., 2015) act close to the time t_max is a way that is nearly optimal given the temporal distribution of action times. A schematic depiction of the temporal distribution of action is shown by the blue curve. As observed experimentally, the peak of the response times is located slightly below T_max. **(B)** Excitatory cells in a network trained with a ramp that terminates at T_max = 1500 ms exhibit sustained activity that terminates close to 1500 ms. A typical response is shown in red. When the paradigm is altered such that t_max = 1000 ms the neural response adapts and terminates close to 1000 ms (typical response in blue). **(C)** The neural response varies from trial to trial. A box-plot (median and quartiles) summarizes the statistics of threshold crossing for networks trained to 1500 ms (1) and 1000 ms (2). **(D)** The distribution of response times depends on the value of the parameter κ that determines the amount of R₂ when no reward is delivered.

**Figure 5**
**Learning to respond to a rewarded pattern**. **(A)** A set of input patterns that are presented to the input layer, only one (red arrow) is rewarded. Other patterns are not rewarded. In this example, the number of patterns is P = 8, and the pattern dimension is 31 × 31. Here a discrete reward is delivered 250 ms after stimulus onset. **(B)** Before training, the weight vector is chosen to be random (left) and the response to all patterns is similar and weak (center). The two eligibility traces are increased due to the stimulus, and at the time of reward (green dashed line) the LTP trace (blue solid line) is stronger than the LTD trace (blue dashed line), resulting in potentiation of stimulated synapses. **(C)** After training, the weight vector has the same structure as the rewarded pattern (left), the response to the rewarded pattern (center) is strong, and the response to other patterns is much weaker. At the time of reward (vertical green dashed line), the LTP and LTD traces are equal in magnitude, such that no further changes in synaptic efficacies occur on average.

**Figure 6**
**Response to rewarded pattern can also learn to represent the magnitude of reward**. **(A)** Two examples for the postsynaptic neurons firing rate, in response to the rewarded pattern after training and one for the unrewarded case. In one example (black) R_p/R_d = 1.2 and in the other (red) R_p/R_d = 1.4. The unrewarded pattern (blue) has a much lower firing rate. This result illustrates that a network trained with two different ratios of LTP vs. LTD associated reward (R^p/R^d) learns to represent the specific ratios. When R^p/R^d increases, so does the magnitude of the response to the rewarded pattern. Here $T_{max}^{d} / T_{max}^{p} = 1.5$ . **(B)** The value of H at steady state increases monotonically as a function of R^d/R^p until a critical value is reached. Beyond this critical value the stable state is H = 0. The solid red line represents the analytical solution and the gray symbols represent the mean and standard deviation of simulations. The variability is over different synapses and time steps. All synapses associated with the rewarded pattern are taken into account. **(C)** H vs. R^p/R^d (as in B) but for different values of $T_{max}^{d} / T_{max}^{p}$ .

**Figure 7**
**Training with a step function reward**. **(A)** A pattern can be rewarded throughout the time it is presented (green shaded area), and not just at one time point as above. In such a case at steady state the integral of the LTP eligibility trace times its associated reward magnitude (solid line) is equal to the LTD eligibility trace times its associated reward (dashed line). **(B)** H at steady state vs. R^p/R^d when $T_{max}^{d} / T_{max}^{p} = 1.5$ . The solid red line represents the analytical solution and the gray bars represent the mean and standard deviation of simulations. The variability is over different synapses and time steps.

See this image and copyright information in PMC

References

1. Beitel R. E., Schreiner C. E., Cheung S. W., Wang X., Merzenich M. M. (2003). Reward-dependent plasticity in the primary auditory cortex of adult monkeys trained to discriminate temporally modulated signals. Proc. Natl. Acad. Sci. U.S.A. 100, 11070–11075. 10.1073/pnas.1334187100 - DOI - PMC - PubMed
1. Cassenaer S., Laurent G. (2012). Conditional modulation of spike-timing-dependent plasticity for olfactory learning. Nature 482, 47–52. 10.1038/nature10776 - DOI - PubMed
1. Chubykin A. A., Roach E. B., Bear M. F., Shuler M. G. H. (2013). A cholinergic mechanism for reward timing within primary visual cortex. Neuron 77, 723–735. 10.1016/j.neuron.2012.12.039 - DOI - PMC - PubMed
1. Gavornik J. P., Shouval H. Z. (2010). A network of spiking neurons that can represent interval timing: mean field analysis. J. Comput. Neurosc. 30, 501–513. 10.1007/s10827-010-0275-y - DOI - PMC - PubMed
1. Gavornik J. P., Shuler M. G. H., Loewenstein Y., Bear M. F., Shouval H. Z. (2009). Learning reward timing in cortex through reward dependent expression of synaptic plasticity. Proc. Natl. Acad. Sci. U.S.A. 106, 6826–6831. 10.1073/pnas.0901835106 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces

Affiliations

The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous