Shifting responsibly: the importance of striatal modularity to reinforcement learning in uncertain environments

Ken-Ichi Amemori¹, Leif G Gibb, Ann M Graybiel

Affiliations

PMID: 21660099
PMCID: PMC3105240
DOI: 10.3389/fnhum.2011.00047

Shifting responsibly: the importance of striatal modularity to reinforcement learning in uncertain environments

Ken-Ichi Amemori et al. Front Hum Neurosci. 2011.

. 2011 May 27:5:47.

doi: 10.3389/fnhum.2011.00047. eCollection 2011.

Authors

Ken-Ichi Amemori¹, Leif G Gibb, Ann M Graybiel

Affiliation

¹ McGovern Institute for Brain Research, Massachusetts Institute of Technology Cambridge, MA, USA.

PMID: 21660099
PMCID: PMC3105240
DOI: 10.3389/fnhum.2011.00047

Abstract

We propose here that the modular organization of the striatum reflects a context-sensitive modular learning architecture in which clustered striosome-matrisome domains participate in modular reinforcement learning (RL). Based on anatomical and physiological evidence, it has been suggested that the modular organization of the striatum could represent a learning architecture. There is not, however, a coherent view of how such a learning architecture could relate to the organization of striatal outputs into the direct and indirect pathways of the basal ganglia, nor a clear formulation of how such a modular architecture relates to the RL functions attributed to the striatum. Here, we hypothesize that striosome-matrisome modules not only learn to bias behavior toward specific actions, as in standard RL, but also learn to assess their own relevance to the environmental context and modulate their own learning and activity on this basis. We further hypothesize that the contextual relevance or "responsibility" of modules is determined by errors in predictions of environmental features and that such responsibility is assigned by striosomes and conveyed to matrisomes via local circuit interneurons. To examine these hypotheses and to identify the general requirements for realizing this architecture in the nervous system, we developed a simple modular RL model. We then constructed a network model of basal ganglia circuitry that includes these modules and the direct and indirect pathways. Based on simple assumptions, this model suggests that while the direct pathway may promote actions based on striatal action values, the indirect pathway may act as a gating network that facilitates or suppresses behavioral modules on the basis of striatal responsibility signals. Our modeling functionally unites the modular compartmental organization of the striatum with the direct-indirect pathway divisions of the basal ganglia, a step that we suggest will have important clinical implications.

Keywords: acetylcholine; basal ganglia; direct and indirect pathways; mixture of experts; modular reinforcement learning; responsibility signal; striatum; striosome and matrix compartments.

PubMed Disclaimer

Figures

**Figure 1**
**Schematic diagram of modular reinforcement learning (RL) model**. Each module m produces a responsibility λ^m and a policy π^m. Responsibility λ^m is calculated based on the accumulated squared prediction error Γ^m, which in turn is based on a comparison of a prediction *p^m* of a feature of the environment with the actual feature (in this case the reward *R_t*). The module with greater λ^m is selected based on the softmax selection rule ρ, which can be seen as a description of a gating network. The modular policy π^m assigns each module's probability of choosing each candidate action based on the modular action-value function *Q^m*. The policy π^m of the selected module determines the actual action *a_t*. The learning or updating of *p^m* and *Q^m* is performed only within the selected module using the global reward signal *R_t*.

**Figure 2**
**Dynamically changing grid world explored by modular RL model**. The agent (*circle*) can move either left or right, and when the agent receives reward, it is returned to the center (s = 7). The reward is placed at either s = 1 or s = 14, and this location alternates every 2500 time steps. When the agent reaches the unrewarded terminal, it is moved one position back (from s = 14 to s = 13 in Env. B and from s = 1 to s = 2 in Env. A). Each of the model's two modules becomes specialized by modular RL to maximize the agent's accumulated reward in one of the two versions of the environment (Env. A and Env. B) defined by reward location.

**Figure 3**
**Action value, prediction, and state value functions of the two modules as the model learns in two versions of the environment**. The reward location alternates every 2500 time steps. Columns represent different locations. States (locations) 2 through 13 are shown; terminal locations are omitted. **(A)** Action value (Q) as a function of state and time. Top two panels represent the action-value function of module A and bottom two panels represent that of module B. Left panels show the values for leftward movements and right panels show the values for rightward movements. After training, module A selectively prefers rightward movements and module B selectively prefers leftward movements. **(B)** Prediction and state value functions of each module. Top two panels represent the prediction (P) of each module. Bottom two panels represent the state value function (V) of each module. Left panels show the functions for module A and right panels show the functions for module B. After training, module A predicts that the reward is likely to be obtained in the rightmost position and assigns the rightmost position the highest state value, while module B makes these assignments for the leftmost position.

**Figure 4**
**Module responsibility, module selection, and preferred location follow changes in environment**. **(A)** Responsibility signals of module B (red) and module A (blue) as functions of time. **(B)** Difference of responsibility signals, λ_B − λ_A, (green line) plotted with the changing environment (Env. A or B; red). Positive differences imply greater module B responsibility, whereas negative differences imply greater module A responsibility. **(C)** Selected module (blue) and environment (red) as functions of time. In environment A (Env. A), reward is located at s = 1. In environment B (Env. B), reward is located at s = 14. Modules switch rapidly at first and then follow changes in environment. **(D)** Location of the agent as a function of time late in training, from time 27350 to time 27650 (green line). Blue line indicates location smoothed with a moving average with window of width 100. Red circles indicate the times and locations at which the agent obtained the reward. Environment changes from Env. B to Env. A at time 27500 (dashed line). The module switches from B to A around 27530. **(E)** Location of the agent as a function of time for the entire training period. Symbols are as in **(D)**. After learning, the agent can obtain rewards in either terminal, depending on the environment. **(F)** Failure of learning of normal, non-modular RL. In this case, the model learns to obtain rewards only at s = 14.

**Figure 5**
**Schematic diagram of cortico-basal ganglia-thalamo-cortical network model**. Red arrows and “(+)” indicate excitatory glutamatergic projections, blue arrows and “(−)” indicate inhibitory GABAergic projections, burgundy arrows indicate modulatory connections, i.e., responsibility signals and dopamine signals. Responsibility signals could potentially be conveyed from striosomes by cholinergic interneurons (ACh), and could modulate dopamine signals (DA) reaching D1 and D2 medium spiny neurons (MSNs) from the SNc/VTA (not explicitly included in our computational model). In addition to its input from the thalamus, the output region of the neocortex receives inputs from the input region of the neocortex and has self-feedback connections. “a” and “b” represent actions and action-related signals (e.g., action-value or action-selection signals), and “A” and “B” represent modules. “D1” and “D2” represent direct-pathway, D1-expressing matrix MSNs and their projections and indirect-pathway, D2-expressing matrix MSNs and their projections, respectively. In the model striatum, matrix MSNs can be in either module, express D1 or D2 dopamine receptors, and represent multiple action values. The evaluation cortex (also not explicitly included in our computational model) is assumed to send signals related to responsibility to striosomes. The responsibility signals then influence action-value representations of matrix MSNs. Both direct and indirect pathways in the model are topographical and convergent at a fine level corresponding to action-value representations. GPe, globus pallidus external segment; STN, subthalamic nucleus; GPi/SNr, globus pallidus internal segment/substantia nigra pars reticulata; SNc/VTA, substantia nigra pars compacta/ventral tegmental area; D1 and D2, D1 and D2 MSNs; ACh, acetylcholine; DA, dopamine.

**Figure 6**
**Influences of input cortex and responsibility signaling on striatal matrix MSN activity in the network model**. We model two neurons of the input region of the cortex, each representing information related to the value of action “a” or “b.” These neurons project, respectively, to a D1 and a D2 matrix MSN in each of two modules in the striatum. **(A)** Cortical and striatal activity (arbitrary units) for simulation using only positive responsibility signals. The effect of positive responsibility signals is labeled “↑ DA,” based on the possibility that they may involve an increase in phasic local striatal dopamine release (triggered by a decrease in local acetylcholine release). Responsibility is assigned to module B at time 50 and module A at time 200 (yellow boxes). Such responsibility signaling transiently increases D1 MSN activity and decreases D2 MSN activity. **(B)** Cortical and striatal activity for simulation using both positive and negative responsibility signals. The effect of negative responsibility signals is labeled “↑ ACh,” based on the possibility that they may involve an increase in local striatal acetylcholine release. Positive responsibility is again assigned to module B at time 50 and module A at time 200 (yellow boxes). Additionally, negative responsibility is assigned to module A at time 50 and module B at time 200 (gray boxes). Negative responsibility signaling transiently increases D2 MSN activity.

**Figure 7**
**Neuronal activity in structures of the cortico-basal ganglia-thalamo-cortical network model**. **(A)** Firing frequency of D1 MSNs in the striatum (n = 300). Color scale indicates firing frequency, x-axis indicates neuron index, and y-axis indicates time in arbitrary units. MSNs on the left (from x = 1 to 150) are in module A and MSNs on the right (from x = 151 to 300) are in module B. Neuron “a” in the input region of the cortex (Figure 6A, left) projects to MSNs 54 and 205, and neuron “b” projects to MSNs 99 and 249. **(B)** Firing frequency of D2 MSNs (n = 300), which receive exactly the same pattern of connections from the input cortex as do D1 MSNs. **(C)** Firing frequency of GPe neurons (n = 100). Adjacent GPe neurons receive overlapping convergent inhibitory input from adjacent striatal D2 MSNs. As a result of this overlapping convergent connectivity, the focal striatal activity causes less focal GPe inhibition (i.e., the inhibition is spread or “blurred” over adjacent GPe neurons; blue troughs). **(D)** Firing frequency of GPi/SNr neurons (n = 50), which receive convergent inhibitory input from striatal D1 MSNs (blue troughs) and excitatory input from STN (red peaks). **(E)** Firing frequency of STN neurons (n = 25), which receive convergent inhibitory input from GPe (red peaks represent lowest inhibition). **(F)** Firing frequency of thalamic neurons (n = 200) in the simulation using only positive responsibility signals. **(G)** Membrane potential of neurons in the output region of the cortex (n = 200) in the simulation using only positive responsibility signals. Vertical red bars represent persistent supra-threshold cortical depolarization maintained by self-feedback connections. **(H)** Firing frequency of thalamic neurons (n = 200) in the simulation using both positive and negative responsibility signals. **(I)** Membrane potential of neurons in the output region of the cortex (n = 200) in the simulation using both positive and negative responsibility signals. The blue troughs observed in the thalamic and cortical activity are deeper in the simulation using both positive and negative responsibility signals. Note: in **(A,B)**, we show only about one out of every six of the inactive MSNs, to make the active MSNs more visible in the figure.

**Figure 8**
**Schematic summary of proposed effects of striatal responsibility signals on module and action selection**. Top: If the sets of actions influenced by module A are appropriate to the environmental context, its striosome (S) assigns high responsibility by sending a signal to adjacent matrisomes (M) via local circuit interneurons. This results in relatively low activity in the indirect pathway and high activity in the direct pathway, which permits the direct pathway to promote selection of an action. Bottom: If the sets of actions influenced by module B are inappropriate to the environmental context, its striosome assigns low responsibility. This results in relatively high activity in the indirect pathway, which suppresses the associated set of candidate actions (behavioral module).

See this image and copyright information in PMC

References

1. Alexander G. E., DeLong M. R., Strick P. L. (1986). Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci. 9, 357–381 - PubMed
1. Amemori K., Graybiel A. M. (2009). “Stimulation of the macaque rostral anterior cingulate cortex alters decision in approach-avoidance conflict,” in Program No. 194.1, 2009 Neuroscience Meeting Planner (Chicago, IL: Society for Neuroscience; ).
1. Amemori K., Graybiel A. M. (2010). “Localized microstimulation of macaque pregenual anterior cingulate cortex increases rejection of cued outcomes in approach-avoidance decision-making,” in Program No. 306.4, 2010 Neuroscience Meeting Planner (San Diego, CA: Society for Neuroscience; ).
1. Amemori K., Ishii S. (2001). Gaussian process approach to spiking neurons for inhomogeneous Poisson inputs. Neural Comput. 13, 2763–2797 10.1162/089976601317098529 - DOI - PubMed
1. Aosaki T., Graybiel A. M., Kimura M. (1994a). Effect of the nigrostriatal dopamine system on acquired neural responses in the striatum of behaving monkeys. Science 265, 412–415 10.1126/science.8023166 - DOI - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Shifting responsibly: the importance of striatal modularity to reinforcement learning in uncertain environments

Affiliation

Shifting responsibly: the importance of striatal modularity to reinforcement learning in uncertain environments

Authors

Affiliation

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources