Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Nov 8;113(45):12868-12873.
doi: 10.1073/pnas.1609094113. Epub 2016 Oct 24.

Adaptive integration of habits into depth-limited planning defines a habitual-goal-directed spectrum

Affiliations

Adaptive integration of habits into depth-limited planning defines a habitual-goal-directed spectrum

Mehdi Keramati et al. Proc Natl Acad Sci U S A. .

Abstract

Behavioral and neural evidence reveal a prospective goal-directed decision process that relies on mental simulation of the environment, and a retrospective habitual process that caches returns previously garnered from available choices. Artificial systems combine the two by simulating the environment up to some depth and then exploiting habitual values as proxies for consequences that may arise in the further future. Using a three-step task, we provide evidence that human subjects use such a normative plan-until-habit strategy, implying a spectrum of approaches that interpolates between habitual and goal-directed responding. We found that increasing time pressure led to shallower goal-directed planning, suggesting that a speed-accuracy tradeoff controls the depth of planning with deeper search leading to more accurate evaluation, at the cost of slower decision-making. We conclude that subjects integrate habit-based cached values directly into goal-directed evaluations in a normative manner.

Keywords: habit; planning; reinforcement learning; speed/accuracy tradeoff; tree-based evaluation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Schematic of the algorithm in an example decision problem (see SI Appendix for the general formal algorithm). Assume an individual has a “mental model” of the reward and transition consequent on taking each action at each state in the environment. The value of taking action a at the current state s is denoted by Q(s,a) and is defined as the sum of rewards (temporally discounted by a factor of 0γ1 per step) that are expected to be received upon performing that action. Q(s,a) can be estimated in different ways. (A) “Planning” involves simulating the tree of future states and actions to arbitrary depths (k) and summing up all of the expected discounted consequences, given a behavioral policy. (B) An intermediate form of control (i.e., plan-until-habit) involves limited-depth forward simulations (k=3 in our example) to foresee the expected consequences of actions up to that depth (i.e., up to state s'). The sum of those foreseen consequences (r0+γr1+γ2r2) is then added to the cached habitual assessment [γkQhabit(s',a')] of the consequences of the remaining choices starting from the deepest explicitly foreseen states (s'). (C) At the other end of the depth-of-planning spectrum, “habitual control” avoids planning (k=0) by relying instead on estimates Qhabit(s,a) that are cached from previous experience. These cached values are updated based on rewards obtained when making a choice.
Fig. 2.
Fig. 2.
Schematic and implementation of the experimental design. (A) Each trial started from state s1, which afforded two actions (illustrated by red and green arrows here). Depending on the chosen action, a common (P=0.7) or rare (P=0.3) transition was made to one of two second-stage states. Again, the subject had two choices, each associated with common (P=0.7) or rare (P=0.3) transitions to two of four third-stage states. After performing a forced-choice action at this terminal state, the subject observed whether or not the resulting third-stage state contained a reward point. In each trial, only one of the four terminal states contained reward. The reward stayed in one terminal state for a random number of trials and then transitioned randomly into one of the three other terminal states. (B) Two groups of subjects performed the task for around 400 trials: a high-resource group (n = 15) and a low-resource group (n = 15) had 2 s and 700 ms, respectively, to react at each of the three stages. See SI Appendix and Figs. S1–S3 for further details.
Fig. 3.
Fig. 3.
Results of simulating artificial agents with different depths of planning in the task described in Fig. 2A. (A) Probabilities, predicted by the three different strategies, for repeating the first-stage choice (“stay probability”) after experiencing common (C) or rare (R) transitions for the first- and second-stage choices (concatenating the letters) and given reward (top row) or its absence (bottom row). The three different strategies (columns, from left to right) are, respectively, pure planning (k=2), planning-until-habit (k=1; planning only one step ahead, and using habitual values at the second stage), and a pure habitual system (k=0; implemented by a model-free temporal-difference learning). Each plot was averaged over 15 agents, each having 500 trials. (B) Mixtures (action selection based on weighted average values) of the first and second strategies, with three different weights. See SI Appendix for details of the simulations and the rationale for the parameters used.
Fig. 4.
Fig. 4.
Behavioral results. Both high-resource (A) and low-resource (B) groups show significant effects of using pure planning (middle column), but only the low-resource group shows a significant effect of using the plan-until-habit strategy (right column) after both rewarded and unrewarded trials. Each black circle represents the average stay probability for one subject, after the indicated types of trial. (C) Model-fitting results show that the weight Wplanuntilhabitstage1 of using the plan-until-habit strategy at the first stage of the task is significantly smaller in the high-resource group than that in the low-resource group (P<0.01). The two curves show the probability distribution of Wplanuntilhabitstage1 in the two groups. Circles show the median of the distribution of Wplanuntilhabitstage1 for each of the subjects. (D) Within both groups, there is a strong correlation across subjects between Wplanuntilhabitstage1 and the weight Whabitstage2 of using the pure habit strategy (against using the planning strategy) at the second stage. Each circle represents the medians of Wplanuntilhabitstage1 and Whabitstage2 for a single subject. Wilcoxon signed-rank test (nonparametric) was used in A and B. Spearman’s rank correlation coefficient test (nonparametric) was used in D.

References

    1. Dickinson A, Balleine BW. The role of learning in motivation. In: Gallistel CR, editor. Steven’s Handbook of Experimental Psychology: Learning, Motivation, and Emotion. 3rd Ed. Vol 3. Wiley; New York: 2002. pp. 497–533.
    1. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8(12):1704–1711. - PubMed
    1. Balleine BW, O’Doherty JP. Human and rodent homologies in action control: Corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacology. 2010;35(1):48–69. - PMC - PubMed
    1. Dolan RJ, Dayan P. Goals and habits in the brain. Neuron. 2013;80(2):312–325. - PMC - PubMed
    1. Doya K. What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Netw. 1999;12(7-8):961–974. - PubMed

LinkOut - more resources