Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Nov 23;107(47):20512-7.
doi: 10.1073/pnas.1013470107. Epub 2010 Oct 25.

Optimal habits can develop spontaneously through sensitivity to local cost

Affiliations

Optimal habits can develop spontaneously through sensitivity to local cost

Theresa M Desrochers et al. Proc Natl Acad Sci U S A. .

Abstract

Habits and rituals are expressed universally across animal species. These behaviors are advantageous in allowing sequential behaviors to be performed without cognitive overload, and appear to rely on neural circuits that are relatively benign but vulnerable to takeover by extreme contexts, neuropsychiatric sequelae, and processes leading to addiction. Reinforcement learning (RL) is thought to underlie the formation of optimal habits. However, this theoretic formulation has principally been tested experimentally in simple stimulus-response tasks with relatively few available responses. We asked whether RL could also account for the emergence of habitual action sequences in realistically complex situations in which no repetitive stimulus-response links were present and in which many response options were present. We exposed naïve macaque monkeys to such experimental conditions by introducing a unique free saccade scan task. Despite the highly uncertain conditions and no instruction, the monkeys developed a succession of stereotypical, self-chosen saccade sequence patterns. Remarkably, these continued to morph for months, long after session-averaged reward and cost (eye movement distance) reached asymptote. Prima facie, these continued behavioral changes appeared to challenge RL. However, trial-by-trial analysis showed that pattern changes on adjacent trials were predicted by lowered cost, and RL simulations that reduced the cost reproduced the monkeys' behavior. Ultimately, the patterns settled into stereotypical saccade sequences that minimized the cost of obtaining the reward on average. These findings suggest that brain mechanisms underlying the emergence of habits, and perhaps unwanted repetitive behaviors in clinical disorders, could follow RL algorithms capturing extremely local explore/exploit tradeoffs.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Schematic of the free-viewing scan task. There was no requirement for the monkey’s eye position when the gray grid was displayed. After a variable Start Delay, the green target grid was presented indicating the start of the Scan Time. When the green target grid was displayed, and once the monkey’s gaze entered the area defined by the green grid, the only requirement was that the eye position remained in that space. After the variable Delay Scan, the Reward Scan began when a randomly chosen target was baited without any indication to the monkey. There was no time limit on the duration of the Reward Scan. Once the monkey captured the baited target by fixating or saccading through it, the green grid immediately turned off and the trial proceeded through the remaining task periods as illustrated. If the monkey’s eye position exited the green grid area before capturing the baited target, the trial was immediately aborted by extinguishing the green target grid, and no reward was delivered.
Fig. 2.
Fig. 2.
Loop sequences emerge and shift during prolonged task performance. Each plot shows the fraction of rewarded trials per session containing the most frequent saccade paths that form a closed loop (start and stop on the same target), regardless of start or stop position during Total Scan Time. (A) Monkey G, four-target task (G4). (B) Monkey Y, four-target task (Y4). (C) Monkey G, nine-target task (G9). (D) Monkey Y, nine-target task (Y9). Dashed line in first panel indicates slight variation from main pattern included in the fraction.
Fig. 3.
Fig. 3.
Nonnegative matrix factorization (NMF) shows successive appearance of factors resembling loop sequences. Each panel displays the weight of each factor during Total Scan Time on all rewarded trials through task performance. Factors are diagrammed in colors to show similarity to the loop sequences in Fig. 2. Numbers on upper corner of the factors indicate their rank order by total magnitude (i.e., sum of the weight across sessions). (A) G4, rms error of factorization = 0.02673. (B) Y4, rms error = 0.02452. (C) G9, rms error = 0.0225. (D) Y9, rms error = 0.01728.
Fig. 4.
Fig. 4.
Session-averaged behavioral measures. All rows show monkeys and conditions in the order depicted in A (G4, Y4, G9, and Y9). (A) Reward rate measured as number of rewards per total Reward Scan time in each session. (B) Mean saccade distance during Reward Scan per session with shading indicating approximate confidence limits (±1.96 × SEM). Gray vertical bars in A and B indicate sessions containing shaping periods when the task was made easier for the monkey (see Materials and Methods and Fig. S1). (C) Entropy of transition probabilities during Total Scan Time.
Fig. 5.
Fig. 5.
Trial-by-trial reinforcement test shows correlation between cost and change in pattern. Distances were simplified to be the geometric distance: one unit is the horizontal or vertical distance between adjacent targets. The change in total scan distance and the pattern dissimilarity (one minus pattern similarity; SI Methods) for each trial was computed. Trials were then binned into 10 equal bins. The median of each of the 10 bins was plotted and used to compute the correlation (red line) between the change in distance and the pattern dissimilarity. The total number of trials (n), correlation coefficients (R), and correlation p values are listed below. Note this p value is different from the P value reported to indicate significance resulting from the shuffle test in the text. (A) G4: n = 6,109 trials; R = 0.613, p = 0.060; slope = 0.006. (B) Y4: n = 25,113; R = 0.737, p = 0.015; slope = 0.002. (C) G9: n = 5,912; R = 0.672, p = 0.033; slope = 0.001. (D) Y9: n = 54,214; R = 0.951, p < 0.0001; slope = 0.005.
Fig. 6.
Fig. 6.
The REINFORCE algorithm simulation performs similarly to the monkeys. (AE) For each row, columns represent conditions depicted in A: REINFORCE simulation of the four-target task (Sim4) and REINFORCE simulation of the nine-target task (Sim9). Reward rate measured as no. of rewards per total simulated trial time in each session (A); mean geometric distance per session (B); entropy of transition probabilities per session (C); final transition probabilities (D); and resulting most probable pattern (E). (F and G) NMF of simulations as in Fig. 3 for Sim4 (F, rms = 0.02314) and Sim9 (G, rms = 0.03655).

Comment in

  • Learning optimal strategies in complex environments.
    Sejnowski TJ. Sejnowski TJ. Proc Natl Acad Sci U S A. 2010 Nov 23;107(47):20151-2. doi: 10.1073/pnas.1014954107. Epub 2010 Nov 15. Proc Natl Acad Sci U S A. 2010. PMID: 21078996 Free PMC article. No abstract available.

Similar articles

Cited by

References

    1. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press; 1998.
    1. Bayer HM, Glimcher PW. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron. 2005;47:129–141. - PMC - PubMed
    1. Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275:1593–1599. - PubMed
    1. Samejima K, Ueda Y, Doya K, Kimura M. Representation of action-specific reward values in the striatum. Science. 2005;310:1337–1340. - PubMed
    1. Morris G, Nevet A, Arkadir D, Vaadia E, Bergman H. Midbrain dopamine neurons encode decisions for future action. Nat Neurosci. 2006;9:1057–1063. - PubMed

Publication types

LinkOut - more resources