Adaptive integration of habits into depth-limited planning defines a habitual-goal-directed spectrum

Mehdi Keramati¹, Peter Smittenaar², Raymond J Dolan^{2

3}, Peter Dayan^{4

3}

Affiliations

¹ Gatsby Computational Neuroscience Unit, University College London, London W1T 4JG, United Kingdom; mehdi@gatsby.ucl.ac.uk.
² Wellcome Trust Centre for Neuroimaging, Institute of Neurology, University College London, London WC1N 3BG, United Kingdom.
³ Max Planck Centre for Computational Psychiatry and Ageing Research, University College London, London WC1B 5EH, United Kingdom.
⁴ Gatsby Computational Neuroscience Unit, University College London, London W1T 4JG, United Kingdom.

PMID: 27791110
PMCID: PMC5111694
DOI: 10.1073/pnas.1609094113

Adaptive integration of habits into depth-limited planning defines a habitual-goal-directed spectrum

Mehdi Keramati et al. Proc Natl Acad Sci U S A. 2016.

. 2016 Nov 8;113(45):12868-12873.

doi: 10.1073/pnas.1609094113. Epub 2016 Oct 24.

Authors

Mehdi Keramati¹, Peter Smittenaar², Raymond J Dolan^{2

3}, Peter Dayan^{4

3}

Affiliations

¹ Gatsby Computational Neuroscience Unit, University College London, London W1T 4JG, United Kingdom; mehdi@gatsby.ucl.ac.uk.
² Wellcome Trust Centre for Neuroimaging, Institute of Neurology, University College London, London WC1N 3BG, United Kingdom.
³ Max Planck Centre for Computational Psychiatry and Ageing Research, University College London, London WC1B 5EH, United Kingdom.
⁴ Gatsby Computational Neuroscience Unit, University College London, London W1T 4JG, United Kingdom.

PMID: 27791110
PMCID: PMC5111694
DOI: 10.1073/pnas.1609094113

Abstract

Behavioral and neural evidence reveal a prospective goal-directed decision process that relies on mental simulation of the environment, and a retrospective habitual process that caches returns previously garnered from available choices. Artificial systems combine the two by simulating the environment up to some depth and then exploiting habitual values as proxies for consequences that may arise in the further future. Using a three-step task, we provide evidence that human subjects use such a normative plan-until-habit strategy, implying a spectrum of approaches that interpolates between habitual and goal-directed responding. We found that increasing time pressure led to shallower goal-directed planning, suggesting that a speed-accuracy tradeoff controls the depth of planning with deeper search leading to more accurate evaluation, at the cost of slower decision-making. We conclude that subjects integrate habit-based cached values directly into goal-directed evaluations in a normative manner.

Keywords: habit; planning; reinforcement learning; speed/accuracy tradeoff; tree-based evaluation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1.**
Schematic of the algorithm in an example decision problem (see *SI Appendix* for the general formal algorithm). Assume an individual has a “mental model” of the reward and transition consequent on taking each action at each state in the environment. The value of taking action $a$ at the current state $s$ is denoted by $Q (s, a)$ and is defined as the sum of rewards (temporally discounted by a factor of $0 \leq γ \leq 1$ per step) that are expected to be received upon performing that action. $Q (s, a)$ can be estimated in different ways. (A) “Planning” involves simulating the tree of future states and actions to arbitrary depths ( $k \to \infty)$ and summing up all of the expected discounted consequences, given a behavioral policy. (B) An intermediate form of control (i.e., plan-until-habit) involves limited-depth forward simulations ( $k = 3$ in our example) to foresee the expected consequences of actions up to that depth (i.e., up to state $s'$ ). The sum of those foreseen consequences ( $r_{0} + γ r_{1} + γ^{2} r_{2}$ ) is then added to the cached habitual assessment [ $γ^{k} Q_{h a b i t} (s', a')$ ] of the consequences of the remaining choices starting from the deepest explicitly foreseen states ( $s'$ ). (C) At the other end of the depth-of-planning spectrum, “habitual control” avoids planning ( $k = 0$ ) by relying instead on estimates $Q_{h a b i t} (s, a)$ that are cached from previous experience. These cached values are updated based on rewards obtained when making a choice.

**Fig. 2.**
Schematic and implementation of the experimental design. (A) Each trial started from state $s_{1}$ , which afforded two actions (illustrated by red and green arrows here). Depending on the chosen action, a common ( $P = 0.7$ ) or rare ( $P = 0.3$ ) transition was made to one of two second-stage states. Again, the subject had two choices, each associated with common ( $P = 0.7$ ) or rare ( $P = 0.3$ ) transitions to two of four third-stage states. After performing a forced-choice action at this terminal state, the subject observed whether or not the resulting third-stage state contained a reward point. In each trial, only one of the four terminal states contained reward. The reward stayed in one terminal state for a random number of trials and then transitioned randomly into one of the three other terminal states. (B) Two groups of subjects performed the task for around 400 trials: a high-resource group (n = 15) and a low-resource group (n = 15) had 2 s and 700 ms, respectively, to react at each of the three stages. See *SI Appendix* and Figs. S1–S3 for further details.

**Fig. 3.**
Results of simulating artificial agents with different depths of planning in the task described in Fig. 2A. (A) Probabilities, predicted by the three different strategies, for repeating the first-stage choice (“stay probability”) after experiencing common (C) or rare (R) transitions for the first- and second-stage choices (concatenating the letters) and given reward (top row) or its absence (bottom row). The three different strategies (columns, from left to right) are, respectively, pure planning ( $k = 2$ ), planning-until-habit ( $k = 1$ ; planning only one step ahead, and using habitual values at the second stage), and a pure habitual system ( $k = 0$ ; implemented by a model-free temporal-difference learning). Each plot was averaged over 15 agents, each having 500 trials. (B) Mixtures (action selection based on weighted average values) of the first and second strategies, with three different weights. See *SI Appendix* for details of the simulations and the rationale for the parameters used.

**Fig. 4.**
Behavioral results. Both high-resource (A) and low-resource (B) groups show significant effects of using pure planning (middle column), but only the low-resource group shows a significant effect of using the plan-until-habit strategy (right column) after both rewarded and unrewarded trials. Each black circle represents the average stay probability for one subject, after the indicated types of trial. (C) Model-fitting results show that the weight $W_{p l a n - u n t i l - h a b i t}^{s t a g e 1}$ of using the plan-until-habit strategy at the first stage of the task is significantly smaller in the high-resource group than that in the low-resource group ( $P < 0.01$ ). The two curves show the probability distribution of $W_{p l a n - u n t i l - h a b i t}^{s t a g e 1}$ in the two groups. Circles show the median of the distribution of $W_{p l a n - u n t i l - h a b i t}^{s t a g e 1}$ for each of the subjects. (D) Within both groups, there is a strong correlation across subjects between $W_{p l a n - u n t i l - h a b i t}^{s t a g e 1}$ and the weight $W_{h a b i t}^{s t a g e 2}$ of using the pure habit strategy (against using the planning strategy) at the second stage. Each circle represents the medians of $W_{p l a n - u n t i l - h a b i t}^{s t a g e 1}$ and $W_{h a b i t}^{s t a g e 2}$ for a single subject. Wilcoxon signed-rank test (nonparametric) was used in A and B. Spearman’s rank correlation coefficient test (nonparametric) was used in D.

See this image and copyright information in PMC

References

1. Dickinson A, Balleine BW. The role of learning in motivation. In: Gallistel CR, editor. Steven’s Handbook of Experimental Psychology: Learning, Motivation, and Emotion. 3rd Ed. Vol 3. Wiley; New York: 2002. pp. 497–533.
1. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8(12):1704–1711. - PubMed
1. Balleine BW, O’Doherty JP. Human and rodent homologies in action control: Corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacology. 2010;35(1):48–69. - PMC - PubMed
1. Dolan RJ, Dayan P. Goals and habits in the brain. Neuron. 2013;80(2):312–325. - PMC - PubMed
1. Doya K. What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Netw. 1999;12(7-8):961–974. - PubMed

Grants and funding

098362/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Adaptive integration of habits into depth-limited planning defines a habitual-goal-directed spectrum

Affiliations

Adaptive integration of habits into depth-limited planning defines a habitual-goal-directed spectrum

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources