Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Dec 15:8:37.
doi: 10.3389/fnsyn.2016.00037. eCollection 2016.

The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces

Affiliations

The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces

Marco A Huertas et al. Front Synaptic Neurosci. .

Abstract

The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for expressing the LTP and LTD traces? Here we expand on our previous model to include several neuromodulators, and illustrate through various examples how different these contribute to learning reward-timing within a wide set of training paradigms and propose further roles that multiple neuromodulators can play in encoding additional information of the rewarding signal.

Keywords: LTD; LTP; eligibility-trace; neuromodulator; reinforcement-learning; reward; synaptic plasticity; timing.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Examples of trace activity for different time-dependent Hebbian functions. (A) Temporal profile of synaptic eligibility traces for a square Hebbian stimulus, illustrated by the dashed red line, with onset at time t = 0 and offset at time t = tstim. Traces rise with a time constant τ~a (Equation 5) to an upper steady state T~a (Equation 4) and decay to zero with a slower time constant τa. (B) A smooth, time-varying Hebbian function (dashed red line) representing the contributions of pre- and post-synaptic cell activity and the resulting synaptic eligibility trace obtained from Equation (7) (blue). After the Hebbian term has decayed to zero, the dynamic of the trace follows an exponential decay (green).
Figure 2
Figure 2
The CRL in a recurrent neural network. (A) With a two-trace rule, an integrate-and-fire network with excitatory recurrent connections can learn to generate different interval times. (B) In a two-population network with fixed inhibitory connections, plastic excitatory connections, and external noise, a large range of interval times can be learned as well. (C) Illustration of the learning process in a recurrent network. Initially (Trial 1, left) the network activity (top panel) quickly decays and so do the LTP (green) and LTD (red) associated eligibility traces (bottom panel). At the time of reward (vertical dashed line) the LTP trace dominates and recurrent connections are strengthened resulting in longer lasting network activity. After 5 trials (middle panel) the network activity lasts for longer, and traces reach saturation (dashed lines). At the time of reward LTP traces still dominate, resulting in an extension of the network time constant. After 10 trials (right), network activity extends almost to the time of reward, and the traces are equal at the time of reward. Therefore, there is no change in the strength of lateral connection and the synaptic plasticity reaches a steady state.
Figure 3
Figure 3
Ramp reward paradigm and neuromodulator release profile. (A) Single neuromodulator case. The reward magnitude as a function of response time is shown by the red line in the upper panel. The longer the subject waits, the bigger the reward, until a time tmax after which no reward is received. It is assumed that the magnitude of the single neuromodulator follows this profile as well. The black curve in the upper panel shows schematically the distribution of response times after learning is complete. In the lower panel a situation in which the maximum reward time (tmax) is decreased (blue) while the action time responses have not yet changed in response to the new reward paradigm. Because the actions occur in the unrewarded region, the network cannot learn this new condition. (B) Two neuromodulator case. The reward paradigm is the same as in (A), but the two neuromodulators respond differently. The two neuromodulator profiles are depicted by different colors as defined in the legend (cyan for LTP and magenta for LTD related traces). For the long reward time (upper panel), both neormodulators have the same profile if there is a reward. However, if there is an action but no reward, only the LTD related neuromodulator is released. This becomes more apparent when the reward time is shifted to a shorter duration (lower panel). Here, most responses are not initially rewarded and only the LTD related neuromodulator is released, triggered by the action.
Figure 4
Figure 4
Training with a ramp reward paradigm. (A) In the ramp reward paradigm, the animal gets a reward that depends on the timing of its action. The magnitude of reward increases linearly with the timing of action, until a maximal time tmax after which no reward is obtained (red line). Rodents trained with this paradigm (Namboodiri et al., 2015) act close to the time tmax is a way that is nearly optimal given the temporal distribution of action times. A schematic depiction of the temporal distribution of action is shown by the blue curve. As observed experimentally, the peak of the response times is located slightly below Tmax. (B) Excitatory cells in a network trained with a ramp that terminates at Tmax = 1500 ms exhibit sustained activity that terminates close to 1500 ms. A typical response is shown in red. When the paradigm is altered such that tmax = 1000 ms the neural response adapts and terminates close to 1000 ms (typical response in blue). (C) The neural response varies from trial to trial. A box-plot (median and quartiles) summarizes the statistics of threshold crossing for networks trained to 1500 ms (1) and 1000 ms (2). (D) The distribution of response times depends on the value of the parameter κ that determines the amount of R2 when no reward is delivered.
Figure 5
Figure 5
Learning to respond to a rewarded pattern. (A) A set of input patterns that are presented to the input layer, only one (red arrow) is rewarded. Other patterns are not rewarded. In this example, the number of patterns is P = 8, and the pattern dimension is 31 × 31. Here a discrete reward is delivered 250 ms after stimulus onset. (B) Before training, the weight vector is chosen to be random (left) and the response to all patterns is similar and weak (center). The two eligibility traces are increased due to the stimulus, and at the time of reward (green dashed line) the LTP trace (blue solid line) is stronger than the LTD trace (blue dashed line), resulting in potentiation of stimulated synapses. (C) After training, the weight vector has the same structure as the rewarded pattern (left), the response to the rewarded pattern (center) is strong, and the response to other patterns is much weaker. At the time of reward (vertical green dashed line), the LTP and LTD traces are equal in magnitude, such that no further changes in synaptic efficacies occur on average.
Figure 6
Figure 6
Response to rewarded pattern can also learn to represent the magnitude of reward. (A) Two examples for the postsynaptic neurons firing rate, in response to the rewarded pattern after training and one for the unrewarded case. In one example (black) Rp/Rd = 1.2 and in the other (red) Rp/Rd = 1.4. The unrewarded pattern (blue) has a much lower firing rate. This result illustrates that a network trained with two different ratios of LTP vs. LTD associated reward (Rp/Rd) learns to represent the specific ratios. When Rp/Rd increases, so does the magnitude of the response to the rewarded pattern. Here Tmaxd/Tmaxp=1.5. (B) The value of H at steady state increases monotonically as a function of Rd/Rp until a critical value is reached. Beyond this critical value the stable state is H = 0. The solid red line represents the analytical solution and the gray symbols represent the mean and standard deviation of simulations. The variability is over different synapses and time steps. All synapses associated with the rewarded pattern are taken into account. (C) H vs. Rp/Rd (as in B) but for different values of Tmaxd/Tmaxp.
Figure 7
Figure 7
Training with a step function reward. (A) A pattern can be rewarded throughout the time it is presented (green shaded area), and not just at one time point as above. In such a case at steady state the integral of the LTP eligibility trace times its associated reward magnitude (solid line) is equal to the LTD eligibility trace times its associated reward (dashed line). (B) H at steady state vs. Rp/Rd when Tmaxd/Tmaxp=1.5. The solid red line represents the analytical solution and the gray bars represent the mean and standard deviation of simulations. The variability is over different synapses and time steps.

Similar articles

Cited by

References

    1. Beitel R. E., Schreiner C. E., Cheung S. W., Wang X., Merzenich M. M. (2003). Reward-dependent plasticity in the primary auditory cortex of adult monkeys trained to discriminate temporally modulated signals. Proc. Natl. Acad. Sci. U.S.A. 100, 11070–11075. 10.1073/pnas.1334187100 - DOI - PMC - PubMed
    1. Cassenaer S., Laurent G. (2012). Conditional modulation of spike-timing-dependent plasticity for olfactory learning. Nature 482, 47–52. 10.1038/nature10776 - DOI - PubMed
    1. Chubykin A. A., Roach E. B., Bear M. F., Shuler M. G. H. (2013). A cholinergic mechanism for reward timing within primary visual cortex. Neuron 77, 723–735. 10.1016/j.neuron.2012.12.039 - DOI - PMC - PubMed
    1. Gavornik J. P., Shouval H. Z. (2010). A network of spiking neurons that can represent interval timing: mean field analysis. J. Comput. Neurosc. 30, 501–513. 10.1007/s10827-010-0275-y - DOI - PMC - PubMed
    1. Gavornik J. P., Shuler M. G. H., Loewenstein Y., Bear M. F., Shouval H. Z. (2009). Learning reward timing in cortex through reward dependent expression of synaptic plasticity. Proc. Natl. Acad. Sci. U.S.A. 106, 6826–6831. 10.1073/pnas.0901835106 - DOI - PMC - PubMed

LinkOut - more resources