Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar;22(3):509-26.
doi: 10.1093/cercor/bhr114. Epub 2011 Jun 21.

Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: computational analysis

Affiliations

Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: computational analysis

Michael J Frank et al. Cereb Cortex. 2012 Mar.

Abstract

Growing evidence suggests that the prefrontal cortex (PFC) is organized hierarchically, with more anterior regions having increasingly abstract representations. How does this organization support hierarchical cognitive control and the rapid discovery of abstract action rules? We present computational models at different levels of description. A neural circuit model simulates interacting corticostriatal circuits organized hierarchically. In each circuit, the basal ganglia gate frontal actions, with some striatal units gating the inputs to PFC and others gating the outputs to influence response selection. Learning at all of these levels is accomplished via dopaminergic reward prediction error signals in each corticostriatal circuit. This functionality allows the system to exhibit conditional if-then hypothesis testing and to learn rapidly in environments with hierarchical structure. We also develop a hybrid Bayesian-reinforcement learning mixture of experts (MoE) model, which can estimate the most likely hypothesis state of individual participants based on their observed sequence of choices and rewards. This model yields accurate probabilistic estimates about which hypotheses are attended by manipulating attentional states in the generative neural model and recovering them with the MoE model. This 2-pronged modeling approach leads to multiple quantitative predictions that are tested with functional magnetic resonance imaging in the companion paper.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Badre et al. (2010) hierarchical reinforcement learning task. A schematic depiction of trial events along with example stimulus-to-response mappings for hierarchical and flat rule sets. (a) Trials began with presentation of a stimulus followed by a green fixation cross. Participants could respond with a button press at any time while the stimulus or green fixation cross was present. After a variable delay following the response, participants received auditory feedback indicating whether the response they had chosen was correct given the presented stimulus. Trials were separated by a variable null interval. (b) Example stimulus-to-response mappings for the Flat set. The arrangement of mappings for the Flat set was such that no higher order relationship was present; thus, each rule had to be learned individually. (c) Example stimulus-to-response mappings for the Hierarchical set. Response mappings in this example are grouped such that in the presence of a red square, only shape determines the response, while in the presence of a blue square, only orientation determines the response.
Figure 2.
Figure 2.
Schematic of hierarchical corticostriatal circuit. In the standard response selection circuit, motor areas of the striatum interact with motor cortex to facilitate response selection based on the learned probability of reward given the current stimulus state. The PMDMaint layer represents possible stimuli to be actively maintained so as to constrain motor selection processes. Its corresponding striatal region learns which stimulus dimensions should be gated into PMd based on the learned probability that their maintenance is predictive of reward. The PMDOut layer represents the deep lamina (e.g., layers 5/6) of PMd in which only a subset of currently maintained PMd stimuli influences response selection, by projecting to the motor striatum. Its corresponding striatal area learns which of the maintained PMd stimuli should be output-gated depending on context. The most anterior prePMd layer maintains stimulus features that act as context, by sending their axons to striatal output-gating areas of PMd. Its corresponding striatal gating layer learns whether the maintenance of particular stimuli as higher order context in prePMd is predictive of reward. Bottom: example network state when presented with color 2, shape 3, and orientation 2. Arrows reflect direct projections, circles reflect BG gating circuitry, and dashed red lines reflect hierarchical flow of control. S3 and O2 are maintained in PMd, and C2 in prePMd. Due to influences of C2 in prePMd, only the shape and not orientation is output-gated. The number of stimulus–response associations is reduced by focusing on PMDOut states.
Figure 3.
Figure 3.
Mixture of experts model. Each flat expert learns reward probabilities for each response given their expert dimension(s) (O, orientation; S, shape; C, color). Responses are selected by each expert using the softmax logistic function. An overall Flat expert learns via a reinforcement credit assignment to allocate attention among the experts in proportion to their reliability. Each hierarchical expert learns to dynamically gate attention to one of 2 dimensions depending on a candidate higher order dimension. The leftmost expert learns to attend to orientation for red contexts and to shape for blue contexts. The overall Hierarchical expert learns which of the hierarchical experts is most reliable, and the overall motor response is selected as a mixture between the two top level experts(representing whether the task structure is likely to be hierarchicalor flat), again in proportion to their reliabilities.
Figure 4.
Figure 4.
(a) Corticostriatal circuit network performance in Badre et al. hierarchical learning task, as a function of trials. Learning is enhanced in hierarchical networks relative to networks with no hierarchical structure (no modulation of PMd circuit by prePMd, “nohier”) and relative to networks with hierarchical structure but no dopamine modulation of learning in striatal gating units (hier_noDAmod). (b) Activity levels in model prePMd in hierarchical and flat conditions. Results in both panels are averaged across 25 different networks with random initial synaptic weights. Error bars reflect standard error of the mean.
Figure 5.
Figure 5.
(a, b) Left: striatal output-gating units from a nonhierarchical network have to learn to output-gate each individual shape feature by assigning strong weights from each the PMd shape units to distinct patterns of Go units. Right: In the hierarchical network, output-gating units of the shape stripe can learn strong Go gating associations whenever the red color unit is active in prePMd. This allows the network to generalize across shapes without learning about each one. (c) An index of this hierarchical gating policy abstraction was computed as a function of the weights from prePMd to striatal output gating units in the PMd circuit that support gating of the hierarchical rule (see text). As networks learned in the hierarchical block, the striatum developed an abstract gating policy (e.g., that gates all shapes for a given color, regardless of the particular shape feature), whereas gating weights for the opposite rule decline. (d) Across 25 hierarchical networks, the degree of gating policy abstraction at the end of the block was tightly correlated with terminal accuracy.
Figure 6.
Figure 6.
MoE model fits to behavior in Hierarchical and Flat conditions. Graph indicates the relationship between the model’s predicted probability that any given response is selected in a given trial (in bins of width 0.1), and the actual proportion of trials in which the associated response was selected by participants in each bin. Results shown across all participants, where each participant’s model was optimized by maximizing the likelihood of their trial-by-trial sequence of responses. There was a strong correlation between model predictions and actual choices in both Hierarchical and Flat conditions (r = 0.99 in both cases). Numerically reduced proportion of actual choices in highest bin in Flat condition was associated with a small number of samples for which model predictions were >0.9.
Figure 7.
Figure 7.
Top: Example attentional weights in a single participant in the hierarchical and flat conditions, as estimated by best fitting model parameters to their trial-by-trial sequences of choices. Within the hierarchical expert, evidence for the correct Hier(O,S|C) expert increases relatively early on, but the overall attention to Hierarchy relative to Flat (dashed red line, WH) does not substantially increase until after trial 200. This participant performed the flat condition second and begins with a prior to attend to hierarchy, but when the evidence does not support it, the weight to hierarchy decreases while the eventual winning full conjunctive expert (black asterisk) increases. Green “lcurve” lines reflect smoothed behavioral learning curves as estimated from a Bayesian state space model, which gives probabilistic estimates about the probability of a correct response at each trial (Smith et al. 2004). Bottom: attentional weights to overall hierarchical versus flat expert for all participants. Some participants show rapid increases in attention to hierarchy, whereas others show delayed and/or mixed attention to hierarchy.
Figure 8.
Figure 8.
(a) Example attentional weights estimated by the MoE fits to trial-by-trial sequence of choices generated by a BG–PFC network in the hierarchical condition. Smoothed learning curve of this network is plotted on the same scale (dotted green line). This network appeared to begin responding primarily relying on unidimensional strategies (particularly orientation; black triangles), which then decrease with experience due to their inconsistent reward associations. The weights to the full 3-way conjunctive expert (black asterisks) increase incrementally as performance improves. Within the hierarchical expert (blue curves), evidence for the correct Hier(O,S|C) expert (blue circles) increases relatively early on, but the overall attention to Hierarchy relative to Flat (wH, dashed red line) does not substantially increase until after trial 200. See Figure 7 for full legend indicating the identity of each expert. (b) Mean (±standard error) attentional weights to hierarchical versus flat expert (wH) estimated across all 25 networks with (red) and without (black) hierarchical structure (projections from prePMD to output gating units of PMd).

References

    1. Akaike H. A new look at the statistical mode identification. IEEE Trans Automat Contr. 1974;19:716–723.
    1. Alexander G, DeLong M, Strick P. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu Rev Neurosci. 1986;9:357–381. - PubMed
    1. Badre D. Cognitive control, hierarchy, and the rostro-caudal organization of the frontal lobes. Trends Cog Sci. 2008;12:193–200. - PubMed
    1. Badre D, D’Esposito M. Functional magnetic resonance imaging evidence for a hierarchical organization of the prefrontal cortex. J Cogn Neurosci. 2007;19:2082–2099. - PubMed
    1. Badre D, Hoffman J, Cooney J, D’Esposito M. Hierarchical cognitive control deficits following damage to the human frontal lobe. Nat Neurosci. 2009;12:515–522. - PMC - PubMed

Publication types