Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jul 28;71(2):370-9.
doi: 10.1016/j.neuron.2011.05.042.

A neural signature of hierarchical reinforcement learning

Affiliations

A neural signature of hierarchical reinforcement learning

José J F Ribas-Fernandes et al. Neuron. .

Abstract

Human behavior displays hierarchical structure: simple actions cohere into subtask sequences, which work together to accomplish overall task goals. Although the neural substrates of such hierarchy have been the target of increasing research, they remain poorly understood. We propose that the computations supporting hierarchical behavior may relate to those in hierarchical reinforcement learning (HRL), a machine-learning framework that extends reinforcement-learning mechanisms into hierarchical domains. To test this, we leveraged a distinctive prediction arising from HRL. In ordinary reinforcement learning, reward prediction errors are computed when there is an unanticipated change in the prospects for accomplishing overall task goals. HRL entails that prediction errors should also occur in relation to task subgoals. In three neuroimaging studies we observed neural responses consistent with such subgoal-related reward prediction errors, within structures previously implicated in reinforcement learning. The results reported support the relevance of HRL to the neural processes underlying hierarchical behavior.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Illustration of HRL dynamics. At t1, a primitive action (a) is selected. Based on the consequent state, a RPE is computed (green arrow from t2 to t1), and used to update the action policy (π) for the preceding state, as well as the value (V) of that state (an estimate of the expected future reward, when starting from that state). At t2 a subroutine (σ) is selected, and remains active through t5. Until then, primitive actions are selected as dictated by σ (lower tier). A PPE is computed after each (lower green arrows from t5 to t2), and used to update the subroutine-specific action policy (πσ) and state values (Vσ). These PPEs are computed with respect to pseudo-reward received at the end of the subroutine (yellow asterisk). Once the subgoal state of σ is reached, σ is terminated. A RPE is computed for the entire subroutine (upper green arrow from t5 to t2), and used to update the value and policy, V and π, associated with the state in which σ was initiated. A new action is then selected at the top level, yielding primary reward (red asterisk). Adapted from Botvinick et al. (2009).
Figure 2
Figure 2
Task and predictions from HRL and RL. Left: Task display and underlying geometry of the delivery task. Right: Prediction-error signals generated by standard RL and by HRL in each category of jump event. Grey bars mark the time-step immediately preceding a jump event. Dashed time-courses indicate the PPE generated in C and D jumps that change the subgoal’s distance by a smaller amount. For simulation methods, see Experimental Procedures.
Figure 3
Figure 3
Results of EEG experiment. Left: Evoked potentials at electrode Cz, aligned to jump events, averaged across participants. D and E refer to jump destinations in Figure 2. The data-series labeled D–E shows the difference between curves D and E, isolating the PPE effect. Right: Scalp topography for condition D, with baseline condition E subtracted (topography plotted on the same grid used in Yeung, Holroyd and Cohen, 2005)
Figure 4
Figure 4
Results of fMRI Experiment 1. Shown are regions displaying a positive correlation with the PPE, independent of subgoal displacement. Talairach coordinates of peak are (0, 9, 39) for dorsal anterior cingulate cortex, and (45, 12, 0) for right anterior insula. Not shown are foci in left anterior insula (−45, 9, −3) and lingual gyrus (0, −66, 0). Color indicates general linear model parameter estimates, ranging from 3.0 × 10−4 (palest yellow) to 1.2 × 10−3 (darkest orange).
Figure 5
Figure 5
Results of behavioral experiment. Left: Example of a choice display. Subgoal 1 would always be on an ellipse defined by the house and the truck. In this example subgoal 2 has smaller overall distance to the goal and larger distance to the truck relative to subgoal 1 (labels not shown to participants). Right: Results of logistic regression on choices and of the comparison between two RL models. Choices were driven significantly by the ratio of distances of the goal of the two subgoals (left box, central mark is the median, edges correspond to 25th and 75th percentiles, whiskers to extreme values, outliers to individual dots outside box and whiskers; each colored dot represents a single participant’s data), whereas the ratio of distances to subgoal did not significantly explain participant’s choices (middle box). Bayes Factors favored the model with only reward for goal attainment and no reward for subgoal against the one with reward for subgoal and goal attainment (right box).

Comment in

References

    1. Badre D. Cognitive control, hierarchy, and the rostro-caudal organization of the frontal lobes. Trends Cogn. Sci. 2008;12:193–200. - PubMed
    1. Badre D, Hoffman J, Cooney J, D'Esposito M. Hierarchical cognitive control deficits following damage to the human frontal lobe. Nat. Neurosci. 2009;12:515–522. - PMC - PubMed
    1. Baker TE, Holroyd CB. Dissociated roles of the anterior cingulate cortex in reward and conflict processing as revealed by the feedback error-related negativity and N200. Biol. Psychol. in press. - PubMed
    1. Barto A, Mahadevan S. Recent advances in hierarchical reinforcement learning. Disc. Event Dyn. Sys. 2003;13:341–379.
    1. Barto AG. Adaptive critics and the basal ganglia. In: Houk JC, Davis J, Beiser D, editors. Models of Information Processing in the Basal Ganglia. Cambridge, MA: MIT Press; 1995. pp. 215–232.

Publication types

LinkOut - more resources