Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Mar 10;112(10):3098-103.
doi: 10.1073/pnas.1414219112. Epub 2015 Feb 9.

Interplay of approximate planning strategies

Affiliations

Interplay of approximate planning strategies

Quentin J M Huys et al. Proc Natl Acad Sci U S A. .

Abstract

Humans routinely formulate plans in domains so complex that even the most powerful computers are taxed. To do so, they seem to avail themselves of many strategies and heuristics that efficiently simplify, approximate, and hierarchically decompose hard tasks into simpler subtasks. Theoretical and cognitive research has revealed several such strategies; however, little is known about their establishment, interaction, and efficiency. Here, we use model-based behavioral analysis to provide a detailed examination of the performance of human subjects in a moderately deep planning task. We find that subjects exploit the structure of the domain to establish subgoals in a way that achieves a nearly maximal reduction in the cost of computing values of choices, but then combine partial searches with greedy local steps to solve subtasks, and maladaptively prune the decision trees of subtasks in a reflexive manner upon encountering salient losses. Subjects come idiosyncratically to favor particular sequences of actions to achieve subgoals, creating novel complex actions or "options."

Keywords: hierarchical reinforcement learning; memoization; planning; pruning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Task. (A) Task display. On each trial, subjects saw six boxes. The bright box indicated the randomly chosen starting location. The number of moves to plan was displayed at the top. During the decision time of 9 s, subjects had to plan between three and five moves. Then, during the input time of 2.5 s, they had to enter their plan as a single sequence of right/left button presses in one go and without immediate feedback as to what state they were currently in or what rewards they had earned in the choice sequence so far. After the entire sequence had been entered, the chosen sequence and the rewards earned were displayed in the order in which they had been entered. Failure to enter a button press sequence of the right length in the given time resulted in a penalty of –200 pence. (B) Task structure. Subjects were placed in one of the six boxes (“states”) at the beginning of each trial and had to plan a path through the maze that maximized their total outcomes earned. From each state, two successor states could be reached deterministically by pressing either the right (dashed lines) or left (solid lines) key. For example, from state 1, state 4 could be reached by pressing left–left–right. Each transition resulted in a deterministic reward or loss. Red arrows, for instance, denote large salient losses of –70 points. The possible transitions were never displayed on screen. (C) Pruning. The decision tree faced by subjects for a depth-3 problem starting in state 3. When encountering one of the large losses (−70, red arrows in B) the search along that subtree is terminated. The blue parts of the tree would thereby not be evaluated and thus the cost of computation would be reduced. In this case, pruning leads to a suboptimal sequence appearing as being optimal. (D) Hierarchical fragmentation of the same problem. Rather than evaluating the entire depth-3 tree, a 2–1 fragmentation would first search the tree up to depth 2 (large green area), choose a depth-2 sequence (black arrow), and then search the remaining depth-1 tree (bottom right green area). The blue area of the tree is again not evaluated. Optimal choices in the fragmented tree may miss the overall optimal sequence, which in this case would be on the far left of the tree. If a subject emitted the sequence on the far right, this sequence would be more likely under the fragmentation 2–1 than under a nonfragmented tree of full depth 3. The effective “subgoal” corresponding to the target of the first fragment (the end state of the subsequence resulting from the first part of the fragmentation) is indicated by a red asterisk.
Fig. 2.
Fig. 2.
Fragmentation. (A) Fragment endpoint distributions when including all fragments within an individual choice sequence. Each panel shows the distribution of end states for fragments starting in each of the six states (fragment start states). The left column shows the end point distribution when considering all fragments. Fragments starting in state 5 terminated in state 2 or state 6 with high probability. The middle column shows the endpoints of fragments of length one, and the rightmost column the endpoints of fragments of length greater than one. (B) Endpoint distribution for fragmentation that achieves the optimal choice at the least computational cost.
Fig. 3.
Fig. 3.
Fragment characteristics. (A) Distribution over inferred fragment lengths. (B) Overall distribution over fragment endpoints. State 2 is the most frequent endpoint. Blue lines in A and B show the distributions for the optimal fragmentations. (C) Nested model comparison. Each bar shows the group-level iBIC score for one model, when adding additional cognitive processes. (D) Over time, only the most frequently used fragment increases in frequency, whereas all others decay and are used less frequently. (E) The entropy of the distribution over fragments used falls nearly linearly over time. (F) Discount factors (within fragments). An outcome lying x transitions ahead is multiplied by 1 − γ a total of x − 1 times. For outcomes lying distant to large losses (“specific pruning”) 1γS is substantially smaller than 1, implying robust discounting. In contrast, for outcomes distant to non-large loss outcomes (“general pruning”), 1γG is indistinguishable from 1 for every subject, meaning that these are not down-weighted within fragments. Thus, subjects search to the end of the fragment but show a strong tendency to stop the search at large losses even within the fragments (1γS<1).

Comment in

  • How to divide and conquer the world, one step at a time.
    Daniel R, Schuck NW, Niv Y. Daniel R, et al. Proc Natl Acad Sci U S A. 2015 Mar 10;112(10):2929-30. doi: 10.1073/pnas.1500975112. Epub 2015 Mar 2. Proc Natl Acad Sci U S A. 2015. PMID: 25733879 Free PMC article. No abstract available.

Similar articles

Cited by

References

    1. Sutton RS, Precup D, Singh S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif Intell. 1999;112:181–211.
    1. Botvinick MM, Niv Y, Barto AC. Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition. 2009;113(3):262–280. - PMC - PubMed
    1. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8(12):1704–1711. - PubMed
    1. Huys QJM, et al. Bonsai trees in your head: How the Pavlovian system sculpts goal-directed choices by pruning decision trees. PLOS Comput Biol. 2012;8(3):e1002410. - PMC - PubMed
    1. Dezfouli A, Balleine BW. Habits, action sequences and reinforcement learning. Eur J Neurosci. 2012;35(7):1036–1051. - PMC - PubMed

Publication types

LinkOut - more resources