. 2010 Nov 23;107(47):20512-7.

doi: 10.1073/pnas.1013470107. Epub 2010 Oct 25.

Optimal habits can develop spontaneously through sensitivity to local cost

Theresa M Desrochers¹, Dezhe Z Jin, Noah D Goodman, Ann M Graybiel

Affiliations

PMID: 20974967
PMCID: PMC2996716
DOI: 10.1073/pnas.1013470107

Optimal habits can develop spontaneously through sensitivity to local cost

Theresa M Desrochers et al. Proc Natl Acad Sci U S A. 2010.

. 2010 Nov 23;107(47):20512-7.

doi: 10.1073/pnas.1013470107. Epub 2010 Oct 25.

Authors

Theresa M Desrochers¹, Dezhe Z Jin, Noah D Goodman, Ann M Graybiel

Affiliation

¹ Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.

PMID: 20974967
PMCID: PMC2996716
DOI: 10.1073/pnas.1013470107

Abstract

Habits and rituals are expressed universally across animal species. These behaviors are advantageous in allowing sequential behaviors to be performed without cognitive overload, and appear to rely on neural circuits that are relatively benign but vulnerable to takeover by extreme contexts, neuropsychiatric sequelae, and processes leading to addiction. Reinforcement learning (RL) is thought to underlie the formation of optimal habits. However, this theoretic formulation has principally been tested experimentally in simple stimulus-response tasks with relatively few available responses. We asked whether RL could also account for the emergence of habitual action sequences in realistically complex situations in which no repetitive stimulus-response links were present and in which many response options were present. We exposed naïve macaque monkeys to such experimental conditions by introducing a unique free saccade scan task. Despite the highly uncertain conditions and no instruction, the monkeys developed a succession of stereotypical, self-chosen saccade sequence patterns. Remarkably, these continued to morph for months, long after session-averaged reward and cost (eye movement distance) reached asymptote. Prima facie, these continued behavioral changes appeared to challenge RL. However, trial-by-trial analysis showed that pattern changes on adjacent trials were predicted by lowered cost, and RL simulations that reduced the cost reproduced the monkeys' behavior. Ultimately, the patterns settled into stereotypical saccade sequences that minimized the cost of obtaining the reward on average. These findings suggest that brain mechanisms underlying the emergence of habits, and perhaps unwanted repetitive behaviors in clinical disorders, could follow RL algorithms capturing extremely local explore/exploit tradeoffs.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1.**
Schematic of the free-viewing scan task. There was no requirement for the monkey’s eye position when the gray grid was displayed. After a variable Start Delay, the green target grid was presented indicating the start of the Scan Time. When the green target grid was displayed, and once the monkey’s gaze entered the area defined by the green grid, the only requirement was that the eye position remained in that space. After the variable Delay Scan, the Reward Scan began when a randomly chosen target was baited without any indication to the monkey. There was no time limit on the duration of the Reward Scan. Once the monkey captured the baited target by fixating or saccading through it, the green grid immediately turned off and the trial proceeded through the remaining task periods as illustrated. If the monkey’s eye position exited the green grid area before capturing the baited target, the trial was immediately aborted by extinguishing the green target grid, and no reward was delivered.

**Fig. 2.**
Loop sequences emerge and shift during prolonged task performance. Each plot shows the fraction of rewarded trials per session containing the most frequent saccade paths that form a closed loop (start and stop on the same target), regardless of start or stop position during Total Scan Time. (A) Monkey G, four-target task (G4). (B) Monkey Y, four-target task (Y4). (C) Monkey G, nine-target task (G9). (D) Monkey Y, nine-target task (Y9). Dashed line in first panel indicates slight variation from main pattern included in the fraction.

**Fig. 3.**
Nonnegative matrix factorization (NMF) shows successive appearance of factors resembling loop sequences. Each panel displays the weight of each factor during Total Scan Time on all rewarded trials through task performance. Factors are diagrammed in colors to show similarity to the loop sequences in Fig. 2. Numbers on upper corner of the factors indicate their rank order by total magnitude (i.e., sum of the weight across sessions). (A) G4, rms error of factorization = 0.02673. (B) Y4, rms error = 0.02452. (C) G9, rms error = 0.0225. (D) Y9, rms error = 0.01728.

**Fig. 4.**
Session-averaged behavioral measures. All rows show monkeys and conditions in the order depicted in A (G4, Y4, G9, and Y9). (A) Reward rate measured as number of rewards per total Reward Scan time in each session. (B) Mean saccade distance during Reward Scan per session with shading indicating approximate confidence limits (±1.96 × SEM). Gray vertical bars in A and B indicate sessions containing shaping periods when the task was made easier for the monkey (see *Materials and Methods* and Fig. S1). (C) Entropy of transition probabilities during Total Scan Time.

**Fig. 5.**
Trial-by-trial reinforcement test shows correlation between cost and change in pattern. Distances were simplified to be the geometric distance: one unit is the horizontal or vertical distance between adjacent targets. The change in total scan distance and the pattern dissimilarity (one minus pattern similarity; *SI Methods*) for each trial was computed. Trials were then binned into 10 equal bins. The median of each of the 10 bins was plotted and used to compute the correlation (red line) between the change in distance and the pattern dissimilarity. The total number of trials (n), correlation coefficients (R), and correlation p values are listed below. Note this p value is different from the P value reported to indicate significance resulting from the shuffle test in the text. (A) G4: n = 6,109 trials; R = 0.613, p = 0.060; slope = 0.006. (B) Y4: n = 25,113; R = 0.737, p = 0.015; slope = 0.002. (C) G9: n = 5,912; R = 0.672, p = 0.033; slope = 0.001. (D) Y9: n = 54,214; R = 0.951, p < 0.0001; slope = 0.005.

**Fig. 6.**
The REINFORCE algorithm simulation performs similarly to the monkeys. (A–E) For each row, columns represent conditions depicted in A: REINFORCE simulation of the four-target task (Sim4) and REINFORCE simulation of the nine-target task (Sim9). Reward rate measured as no. of rewards per total simulated trial time in each session (A); mean geometric distance per session (B); entropy of transition probabilities per session (C); final transition probabilities (D); and resulting most probable pattern (E). (F and G) NMF of simulations as in Fig. 3 for Sim4 (F, rms = 0.02314) and Sim9 (G, rms = 0.03655).

See this image and copyright information in PMC

Comment in

Learning optimal strategies in complex environments.
Sejnowski TJ. Sejnowski TJ. Proc Natl Acad Sci U S A. 2010 Nov 23;107(47):20151-2. doi: 10.1073/pnas.1014954107. Epub 2010 Nov 15. Proc Natl Acad Sci U S A. 2010. PMID: 21078996 Free PMC article. No abstract available.

Cited by

Embodied and embedded ecological rationality: A common vertebrate mechanism for action selection underlies cognition and heuristic decision-making in humans.
Nordli SA, Todd PM. Nordli SA, et al. Front Psychol. 2022 Nov 17;13:841972. doi: 10.3389/fpsyg.2022.841972. eCollection 2022. Front Psychol. 2022. PMID: 36467131 Free PMC article. Review.
Habit Learning by Naive Macaques Is Marked by Response Sharpening of Striatal Neurons Representing the Cost and Outcome of Acquired Action Sequences.
Desrochers TM, Amemori K, Graybiel AM. Desrochers TM, et al. Neuron. 2015 Aug 19;87(4):853-68. doi: 10.1016/j.neuron.2015.07.019. Neuron. 2015. PMID: 26291166 Free PMC article.
Goal-oriented searching mediated by ventral hippocampus early in trial-and-error learning.
Ruediger S, Spirig D, Donato F, Caroni P. Ruediger S, et al. Nat Neurosci. 2012 Nov;15(11):1563-71. doi: 10.1038/nn.3224. Epub 2012 Sep 23. Nat Neurosci. 2012. PMID: 23001061
Hierarchical Reinforcement Learning, Sequential Behavior, and the Dorsal Frontostriatal System.
Janssen M, LeWarne C, Burk D, Averbeck BB. Janssen M, et al. J Cogn Neurosci. 2022 Jul 1;34(8):1307-1325. doi: 10.1162/jocn_a_01869. J Cogn Neurosci. 2022. PMID: 35579977 Free PMC article. Review.
Striosomes control dopamine via dual pathways paralleling canonical basal ganglia circuits.
Lazaridis I, Crittenden JR, Ahn G, Hirokane K, Wickersham IR, Yoshida T, Mahar A, Skara V, Loftus JH, Parvataneni K, Meletis K, Ting JT, Hueske E, Matsushima A, Graybiel AM. Lazaridis I, et al. Curr Biol. 2024 Nov 18;34(22):5263-5283.e8. doi: 10.1016/j.cub.2024.09.070. Epub 2024 Oct 23. Curr Biol. 2024. PMID: 39447573

See all "Cited by" articles

References

1. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press; 1998.
1. Bayer HM, Glimcher PW. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron. 2005;47:129–141. - PMC - PubMed
1. Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275:1593–1599. - PubMed
1. Samejima K, Ueda Y, Doya K, Kimura M. Representation of action-specific reward values in the striatum. Science. 2005;310:1337–1340. - PubMed
1. Morris G, Nevet A, Arkadir D, Vaadia E, Bergman H. Midbrain dopamine neurons encode decisions for future action. Nat Neurosci. 2006;9:1057–1063. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimal habits can develop spontaneously through sensitivity to local cost

Affiliation

Optimal habits can develop spontaneously through sensitivity to local cost

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources