Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Apr 6;31(14):5526-39.
doi: 10.1523/JNEUROSCI.4647-10.2011.

Neural correlates of forward planning in a spatial decision task in humans

Affiliations

Neural correlates of forward planning in a spatial decision task in humans

Dylan Alexander Simon et al. J Neurosci. .

Abstract

Although reinforcement learning (RL) theories have been influential in characterizing the mechanisms for reward-guided choice in the brain, the predominant temporal difference (TD) algorithm cannot explain many flexible or goal-directed actions that have been demonstrated behaviorally. We investigate such actions by contrasting an RL algorithm that is model based, in that it relies on learning a map or model of the task and planning within it, to traditional model-free TD learning. To distinguish these approaches in humans, we used functional magnetic resonance imaging in a continuous spatial navigation task, in which frequent changes to the layout of the maze forced subjects continually to relearn their favored routes, thereby exposing the RL mechanisms used. We sought evidence for the neural substrates of such mechanisms by comparing choice behavior and blood oxygen level-dependent (BOLD) signals to decision variables extracted from simulations of either algorithm. Both choices and value-related BOLD signals in striatum, although most often associated with TD learning, were better explained by the model-based theory. Furthermore, predecessor quantities for the model-based value computation were correlated with BOLD signals in the medial temporal lobe and frontal cortex. These results point to a significant extension of both the computational and anatomical substrates for RL in the brain.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Task flow and example state. A, Subjects were cued to choose a direction by pressing a key. If the subject did not respond within 2 s, she lost a turn and was again presented with the same choice (no movement). Otherwise, an animation was shown moving to the room in the selected direction (or to a random room for randomly occurring jumps); this movement lasted 1.5–2 s, jittered uniformly. Then, the next room was presented, including the available transitions from that room and any received reward. Finally, after 0.5 s, the subject was cued to make the next decision. Only the doors in the current room were visible to the subject. B, A possible abstract layout of the task, where each square represents a room, and each arrow represents an available door direction the subject may choose from. The circles represent reward locations, where the subject would gain the indicated reward value each time the room was visited. At each step, each one-way door could flip direction independently with probability 124.
Figure 2.
Figure 2.
Behavioral model likelihood comparison. Negative log-likelihood evidence values under BIC. Shown are per-subject log Bayes factors comparing planning against TD.
Figure 3.
Figure 3.
Value-responsive areas. A, B, T statistic map of group response size to planned (A) and TD-based (B) value predictions from separate models (shown at p < 0.001, uncorrected; significant p < 0.05 FDR clusters highlighted).
Figure 4.
Figure 4.
Identification of value-related voxels of interest. T statistic map of group response size to either planned or TD-based value predictions (summed contrast, shown at p < 0.001, uncorrected; significance not assessed). The most responsive peak voxels of this map anatomically within striatum were identified for additional analysis.
Figure 5.
Figure 5.
Striatal BOLD responses to partial value components. Responses (mean effect sizes, arbitrary units) to key components of the value predictions as predicted by the two algorithms in the previously identified VOIs. Also shown are the predicted responses from the overall value fit assuming exponential discounting and updating. Note that significances, as indicated by *p < 0.05 and **p < 0.01, are biased by voxel selection.
Figure 6.
Figure 6.
Responses to predicted next-step rewards beyond chosen values. T statistic map of responsive regions to choices that are expected to lead to a reward room (r1), greater than the first two terms of the value equation (r1 + γr2; shown at p < 0.001, uncorrected; significant p < 0.05 FDR clusters highlighted).
Figure 7.
Figure 7.
Response to both one-step predicted and immediate choice count. Masked T statistic map of responses to expected next-step choice set size within regions responsive to current choice set size (all n0 significant p < 0.05 FDR cluster size; n1 shown at p < 0.001, uncorrected; two-tailed).

References

    1. Ainslie G. Cambridge, UK: Cambridge UP; 2001. Breakdown of will.
    1. Arkadir D, Morris G, Vaadia E, Bergman H. Independent coding of movement direction and reward prediction by single pallidal neurons. J Neurosci. 2004;24:10047–10056. - PMC - PubMed
    1. Balleine BW, Dickinson A. Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology. 1998;37:407–419. - PubMed
    1. Balleine BW, Delgado MR, Hikosaka O. The role of the dorsal striatum in reward and decision-making. J Neurosci. 2007;27:8161–8165. - PMC - PubMed
    1. Balleine BW, Daw ND, O'Doherty JP. Multiple forms of value learning and the function of dopamine. In: Glimcher PW, Camerer CF, Fehr E, Poldrack RA, editors. Neuroeconomics: decision making and the brain, Chap 24. London: Academic; 2008. pp. 367–387.

Publication types

MeSH terms