Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2009 Dec;113(3):262-280.
doi: 10.1016/j.cognition.2008.08.011. Epub 2008 Oct 15.

Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective

Affiliations
Review

Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective

Matthew M Botvinick et al. Cognition. 2009 Dec.

Abstract

Research on human and animal behavior has long emphasized its hierarchical structure-the divisibility of ongoing behavior into discrete tasks, which are comprised of subtask sequences, which in turn are built of simple actions. The hierarchical structure of behavior has also been of enduring interest within neuroscience, where it has been widely considered to reflect prefrontal cortical functions. In this paper, we reexamine behavioral hierarchy and its neural substrates from the point of view of recent developments in computational reinforcement learning. Specifically, we consider a set of approaches known collectively as hierarchical reinforcement learning, which extend the reinforcement learning paradigm by allowing the learning agent to aggregate actions into reusable subroutines or skills. A close look at the components of hierarchical reinforcement learning suggests how they might map onto neural structures, in particular regions within the dorsolateral and orbital prefrontal cortex. It also suggests specific ways in which hierarchical reinforcement learning might provide a complement to existing psychological models of hierarchically structured behavior. A particularly important question that hierarchical reinforcement learning brings to the fore is that of how learning identifies new action routines that are likely to provide useful building blocks in solving a wide range of future problems. Here and at many other points, hierarchical reinforcement learning offers an appealing framework for investigating the computational and neural underpinnings of hierarchically structured behavior.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An illustration of how options can facilitate search. (A) A search tree with arrows indicating the pathway to a goal state. A specific sequence of seven independently selected actions is required to reach the goal. (B) The same tree and trajectory, the colors indicating that the first four and the last three actions have been aggregated into options. Here, the goal state is reached after only two independent choices (selection of the options). (C) Illustration of search using option models, which allow the ultimate consequences of an option to be forecast without requiring consideration of the lower-level steps that would be involved in executing the option.
Figure 2
Figure 2
An actor-critic implementation. (A) Schematic of the basic actor-critic architecture. R(s): reward function; V(s): value function; δ: temporal difference prediction error; π(s): policy, determined by action strengths W. (B) An actor critic implementation of HRL. o: currently controlling option, Ro(s): option-dependent reward function. Vo(s): option-specific value functions; δ: temporal difference prediction error; πo(s): option-specific policies, determined by option-specific action/option strengths. (C) Putative neural correlates to components of the elements diagramed in panel A. (D) Potential neural correlates to components of the elements diagramed in panel C. Abbreviations: DA: dopamine; DLPFC: dorsolateral prefrontal cortex, plus other frontal structures potentially including premotor, supplementary motor and pre-supplementary motor cortices; DLS, dorsolateral striatum; HT+: hypothalamus and other structures, potentially including the habenula, the pedunculopontine nucleus, and the superior colliculus; OFC: orbitofrontal cortex; VS, ventral striatum.
Figure 3
Figure 3
A schematic illustration of HRL dynamics. a, primitive actions; o, option. On the first timestep (t = 1), the agent executes a primitive action (forward arrow). Based on the consequent state (i.e., the state at t = 2), a prediction error δ is computed (arrow running from t = 2 to t = 1), and used to update the value (V) and action/option strengths (W) associated with the preceding state. At t = 2, the agent selects an option (long forward arrow), which remains active through t = 5. During this time, primitive actions are selected according to the option’s policy (lower tier of forward arrows). A prediction error is computed after each (lower tier of curved arrows), and used to update the option-specific values (Vo) and action strengths (Wo) associated with the preceding state. These prediction errors, unlike those at the level above, take into account pseudo-reward received throughout the execution of the option (higher asterisk). Once the option’s subgoal state is reached, the option is terminated. A prediction error is computed for the entire option (long curved arrow), and this is used to update the values and option strengths associated with the state in which the option was initiated. The agent then selects a new action at the top level, which yields external reward (lower asterisk). The prediction errors computed at the top level, but not at the level below, take this reward into account.
Figure 4
Figure 4
(A) The rooms problem, adapted from Sutton et al. (1999). S: start; G: goal. (B) Learning curves for the eight doorway options, plotted over the first 150 occurrences of each (mean over 100 simulation runs). See online appendix for simulation details. (C) The upper left room from panel A, illustrating the policy learned by one doorway option. Arrows indicate the primitive action selected most frequently in each state. SG: option subgoal. Colors indicate the option-specific value for each state. (D) Learning curves indicating solution times, i.e., number of primitive steps to goal, on the problem illustrated in panel A (mean over 100 simulation runs). Upper data series: Performance when only primitive actions were included. Lower series: Performance when both primitive actions and doorway options were included. Policies for doorway options were established through earlier training (see online appendix).
Figure 5
Figure 5
(A) The rooms problem from Figure 4, with ‘windows’ (w) defining option subgoals. (B) Learning curves for the problem illustrated in panel A. Lower data series: steps to goal over episodes with only primitive actions included (mean values over 100 simulation runs). Upper series: performance with both primitive actions and window options included. (C) Illustration of performance when a ‘shortcut’ is opened up between the upper right and lower left rooms (center tile). Lower trajectory: path to goal most frequently taken after learning with only primitive actions included. Upper trajectory: path most frequently taken after learning with both primitive actions and doorway options. Black arrows indicate primitive actions selected by the root policy. Other arrows indicate primitive actions selected by two doorway options.
Figure 6
Figure 6
Illustration of the role of the prefrontal cortex, as postulated by guided activation theory (Miller & Cohen, 2001). Patterns of activation in prefrontal cortex (filled elements in the boxed region) effectively select among stimulus-response pathways lying elsewhere in the brain (lower area). Here, representations within prefrontal cortex correspond to option identifiers in HRL, while the stimulus-response pathways selected correspond to option-specific policies. Figure adapted from Miller and Cohen (2001, permission pending).

References

    1. Agre PE. The dynamic structure of everyday life (Tech. Rep. No. 1085) Cambridge, MA: Massachusetts Institute of Technology, Artificial Intelligence Laboratory; 1988.
    1. Aldridge JW, Berridge KC, Rosen AR. Basal ganglia neural mechanisms of natural movement sequences. Canadian Journal of Physiology and Pharmacology. 2004;82:732–739. - PubMed
    1. Aldridge WJ, Berridge KC. Coding of serial order by neostriatal neurons: a "natural action" approach to movement sequence. Journal of Neuroscience. 1998;18:2777–2787. - PMC - PubMed
    1. Alexander GE, Crutcher MD, DeLong MR. Basal ganglia-thalamocortical circuits: parallel substrates for motor, oculomotor, "prefrontal" and "limbic" functions. Progress in Brain Research. 1990;85:119–146. - PubMed
    1. Alexander GE, DeLong MR, Strick PL. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annual Review of Neuroscience. 1986;9:357–381. - PubMed

LinkOut - more resources