Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 May 27:5:47.
doi: 10.3389/fnhum.2011.00047. eCollection 2011.

Shifting responsibly: the importance of striatal modularity to reinforcement learning in uncertain environments

Affiliations

Shifting responsibly: the importance of striatal modularity to reinforcement learning in uncertain environments

Ken-Ichi Amemori et al. Front Hum Neurosci. .

Abstract

We propose here that the modular organization of the striatum reflects a context-sensitive modular learning architecture in which clustered striosome-matrisome domains participate in modular reinforcement learning (RL). Based on anatomical and physiological evidence, it has been suggested that the modular organization of the striatum could represent a learning architecture. There is not, however, a coherent view of how such a learning architecture could relate to the organization of striatal outputs into the direct and indirect pathways of the basal ganglia, nor a clear formulation of how such a modular architecture relates to the RL functions attributed to the striatum. Here, we hypothesize that striosome-matrisome modules not only learn to bias behavior toward specific actions, as in standard RL, but also learn to assess their own relevance to the environmental context and modulate their own learning and activity on this basis. We further hypothesize that the contextual relevance or "responsibility" of modules is determined by errors in predictions of environmental features and that such responsibility is assigned by striosomes and conveyed to matrisomes via local circuit interneurons. To examine these hypotheses and to identify the general requirements for realizing this architecture in the nervous system, we developed a simple modular RL model. We then constructed a network model of basal ganglia circuitry that includes these modules and the direct and indirect pathways. Based on simple assumptions, this model suggests that while the direct pathway may promote actions based on striatal action values, the indirect pathway may act as a gating network that facilitates or suppresses behavioral modules on the basis of striatal responsibility signals. Our modeling functionally unites the modular compartmental organization of the striatum with the direct-indirect pathway divisions of the basal ganglia, a step that we suggest will have important clinical implications.

Keywords: acetylcholine; basal ganglia; direct and indirect pathways; mixture of experts; modular reinforcement learning; responsibility signal; striatum; striosome and matrix compartments.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic diagram of modular reinforcement learning (RL) model. Each module m produces a responsibility λm and a policy πm. Responsibility λm is calculated based on the accumulated squared prediction error Γm, which in turn is based on a comparison of a prediction pm of a feature of the environment with the actual feature (in this case the reward Rt). The module with greater λm is selected based on the softmax selection rule ρ, which can be seen as a description of a gating network. The modular policy πm assigns each module's probability of choosing each candidate action based on the modular action-value function Qm. The policy πm of the selected module determines the actual action at. The learning or updating of pm and Qm is performed only within the selected module using the global reward signal Rt.
Figure 2
Figure 2
Dynamically changing grid world explored by modular RL model. The agent (circle) can move either left or right, and when the agent receives reward, it is returned to the center (s = 7). The reward is placed at either s = 1 or s = 14, and this location alternates every 2500 time steps. When the agent reaches the unrewarded terminal, it is moved one position back (from s = 14 to s = 13 in Env. B and from s = 1 to s = 2 in Env. A). Each of the model's two modules becomes specialized by modular RL to maximize the agent's accumulated reward in one of the two versions of the environment (Env. A and Env. B) defined by reward location.
Figure 3
Figure 3
Action value, prediction, and state value functions of the two modules as the model learns in two versions of the environment. The reward location alternates every 2500 time steps. Columns represent different locations. States (locations) 2 through 13 are shown; terminal locations are omitted. (A) Action value (Q) as a function of state and time. Top two panels represent the action-value function of module A and bottom two panels represent that of module B. Left panels show the values for leftward movements and right panels show the values for rightward movements. After training, module A selectively prefers rightward movements and module B selectively prefers leftward movements. (B) Prediction and state value functions of each module. Top two panels represent the prediction (P) of each module. Bottom two panels represent the state value function (V) of each module. Left panels show the functions for module A and right panels show the functions for module B. After training, module A predicts that the reward is likely to be obtained in the rightmost position and assigns the rightmost position the highest state value, while module B makes these assignments for the leftmost position.
Figure 4
Figure 4
Module responsibility, module selection, and preferred location follow changes in environment. (A) Responsibility signals of module B (red) and module A (blue) as functions of time. (B) Difference of responsibility signals, λB − λA, (green line) plotted with the changing environment (Env. A or B; red). Positive differences imply greater module B responsibility, whereas negative differences imply greater module A responsibility. (C) Selected module (blue) and environment (red) as functions of time. In environment A (Env. A), reward is located at s = 1. In environment B (Env. B), reward is located at s = 14. Modules switch rapidly at first and then follow changes in environment. (D) Location of the agent as a function of time late in training, from time 27350 to time 27650 (green line). Blue line indicates location smoothed with a moving average with window of width 100. Red circles indicate the times and locations at which the agent obtained the reward. Environment changes from Env. B to Env. A at time 27500 (dashed line). The module switches from B to A around 27530. (E) Location of the agent as a function of time for the entire training period. Symbols are as in (D). After learning, the agent can obtain rewards in either terminal, depending on the environment. (F) Failure of learning of normal, non-modular RL. In this case, the model learns to obtain rewards only at s = 14.
Figure 5
Figure 5
Schematic diagram of cortico-basal ganglia-thalamo-cortical network model. Red arrows and “(+)” indicate excitatory glutamatergic projections, blue arrows and “(−)” indicate inhibitory GABAergic projections, burgundy arrows indicate modulatory connections, i.e., responsibility signals and dopamine signals. Responsibility signals could potentially be conveyed from striosomes by cholinergic interneurons (ACh), and could modulate dopamine signals (DA) reaching D1 and D2 medium spiny neurons (MSNs) from the SNc/VTA (not explicitly included in our computational model). In addition to its input from the thalamus, the output region of the neocortex receives inputs from the input region of the neocortex and has self-feedback connections. “a” and “b” represent actions and action-related signals (e.g., action-value or action-selection signals), and “A” and “B” represent modules. “D1” and “D2” represent direct-pathway, D1-expressing matrix MSNs and their projections and indirect-pathway, D2-expressing matrix MSNs and their projections, respectively. In the model striatum, matrix MSNs can be in either module, express D1 or D2 dopamine receptors, and represent multiple action values. The evaluation cortex (also not explicitly included in our computational model) is assumed to send signals related to responsibility to striosomes. The responsibility signals then influence action-value representations of matrix MSNs. Both direct and indirect pathways in the model are topographical and convergent at a fine level corresponding to action-value representations. GPe, globus pallidus external segment; STN, subthalamic nucleus; GPi/SNr, globus pallidus internal segment/substantia nigra pars reticulata; SNc/VTA, substantia nigra pars compacta/ventral tegmental area; D1 and D2, D1 and D2 MSNs; ACh, acetylcholine; DA, dopamine.
Figure 6
Figure 6
Influences of input cortex and responsibility signaling on striatal matrix MSN activity in the network model. We model two neurons of the input region of the cortex, each representing information related to the value of action “a” or “b.” These neurons project, respectively, to a D1 and a D2 matrix MSN in each of two modules in the striatum. (A) Cortical and striatal activity (arbitrary units) for simulation using only positive responsibility signals. The effect of positive responsibility signals is labeled “↑ DA,” based on the possibility that they may involve an increase in phasic local striatal dopamine release (triggered by a decrease in local acetylcholine release). Responsibility is assigned to module B at time 50 and module A at time 200 (yellow boxes). Such responsibility signaling transiently increases D1 MSN activity and decreases D2 MSN activity. (B) Cortical and striatal activity for simulation using both positive and negative responsibility signals. The effect of negative responsibility signals is labeled “↑ ACh,” based on the possibility that they may involve an increase in local striatal acetylcholine release. Positive responsibility is again assigned to module B at time 50 and module A at time 200 (yellow boxes). Additionally, negative responsibility is assigned to module A at time 50 and module B at time 200 (gray boxes). Negative responsibility signaling transiently increases D2 MSN activity.
Figure 7
Figure 7
Neuronal activity in structures of the cortico-basal ganglia-thalamo-cortical network model. (A) Firing frequency of D1 MSNs in the striatum (n = 300). Color scale indicates firing frequency, x-axis indicates neuron index, and y-axis indicates time in arbitrary units. MSNs on the left (from x = 1 to 150) are in module A and MSNs on the right (from x = 151 to 300) are in module B. Neuron “a” in the input region of the cortex (Figure 6A, left) projects to MSNs 54 and 205, and neuron “b” projects to MSNs 99 and 249. (B) Firing frequency of D2 MSNs (n = 300), which receive exactly the same pattern of connections from the input cortex as do D1 MSNs. (C) Firing frequency of GPe neurons (n = 100). Adjacent GPe neurons receive overlapping convergent inhibitory input from adjacent striatal D2 MSNs. As a result of this overlapping convergent connectivity, the focal striatal activity causes less focal GPe inhibition (i.e., the inhibition is spread or “blurred” over adjacent GPe neurons; blue troughs). (D) Firing frequency of GPi/SNr neurons (n = 50), which receive convergent inhibitory input from striatal D1 MSNs (blue troughs) and excitatory input from STN (red peaks). (E) Firing frequency of STN neurons (n = 25), which receive convergent inhibitory input from GPe (red peaks represent lowest inhibition). (F) Firing frequency of thalamic neurons (n = 200) in the simulation using only positive responsibility signals. (G) Membrane potential of neurons in the output region of the cortex (n = 200) in the simulation using only positive responsibility signals. Vertical red bars represent persistent supra-threshold cortical depolarization maintained by self-feedback connections. (H) Firing frequency of thalamic neurons (n = 200) in the simulation using both positive and negative responsibility signals. (I) Membrane potential of neurons in the output region of the cortex (n = 200) in the simulation using both positive and negative responsibility signals. The blue troughs observed in the thalamic and cortical activity are deeper in the simulation using both positive and negative responsibility signals. Note: in (A,B), we show only about one out of every six of the inactive MSNs, to make the active MSNs more visible in the figure.
Figure 8
Figure 8
Schematic summary of proposed effects of striatal responsibility signals on module and action selection. Top: If the sets of actions influenced by module A are appropriate to the environmental context, its striosome (S) assigns high responsibility by sending a signal to adjacent matrisomes (M) via local circuit interneurons. This results in relatively low activity in the indirect pathway and high activity in the direct pathway, which permits the direct pathway to promote selection of an action. Bottom: If the sets of actions influenced by module B are inappropriate to the environmental context, its striosome assigns low responsibility. This results in relatively high activity in the indirect pathway, which suppresses the associated set of candidate actions (behavioral module).

References

    1. Alexander G. E., DeLong M. R., Strick P. L. (1986). Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci. 9, 357–381 - PubMed
    1. Amemori K., Graybiel A. M. (2009). “Stimulation of the macaque rostral anterior cingulate cortex alters decision in approach-avoidance conflict,” in Program No. 194.1, 2009 Neuroscience Meeting Planner (Chicago, IL: Society for Neuroscience; ).
    1. Amemori K., Graybiel A. M. (2010). “Localized microstimulation of macaque pregenual anterior cingulate cortex increases rejection of cued outcomes in approach-avoidance decision-making,” in Program No. 306.4, 2010 Neuroscience Meeting Planner (San Diego, CA: Society for Neuroscience; ).
    1. Amemori K., Ishii S. (2001). Gaussian process approach to spiking neurons for inhomogeneous Poisson inputs. Neural Comput. 13, 2763–279710.1162/089976601317098529 - DOI - PubMed
    1. Aosaki T., Graybiel A. M., Kimura M. (1994a). Effect of the nigrostriatal dopamine system on acquired neural responses in the striatum of behaving monkeys. Science 265, 412–41510.1126/science.8023166 - DOI - PubMed

LinkOut - more resources