Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2012 Apr;35(7):1036-51.
doi: 10.1111/j.1460-9568.2012.08050.x.

Habits, action sequences and reinforcement learning

Affiliations
Review

Habits, action sequences and reinforcement learning

Amir Dezfouli et al. Eur J Neurosci. 2012 Apr.

Abstract

It is now widely accepted that instrumental actions can be either goal-directed or habitual; whereas the former are rapidly acquired and regulated by their outcome, the latter are reflexive, elicited by antecedent stimuli rather than their consequences. Model-based reinforcement learning (RL) provides an elegant description of goal-directed action. Through exposure to states, actions and rewards, the agent rapidly constructs a model of the world and can choose an appropriate action based on quite abstract changes in environmental and evaluative demands. This model is powerful but has a problem explaining the development of habitual actions. To account for habits, theorists have argued that another action controller is required, called model-free RL, that does not form a model of the world but rather caches action values within states allowing a state to select an action based on its reward history rather than its consequences. Nevertheless, there are persistent problems with important predictions from the model; most notably the failure of model-free RL correctly to predict the insensitivity of habitual actions to changes in the action-reward contingency. Here, we suggest that introducing model-free RL in instrumental conditioning is unnecessary, and demonstrate that reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do. This approach has significant implications for the way habits are currently investigated and generates new experimental predictions.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Four groups of rats (n=8) were trained to lever press for a 20% sucrose solution on random-interval schedules (RI1, RI15, RI30, RI60) with moderately trained rats allowed to earn 120 sucrose deliveries and overtrained rats 360 sugar deliveries (the latter involving an additional 8 sessions of RI60 training with 30 sucrose deliveries per session). (A) For the devaluation assessment, half of each group was then sated either on the sucrose or on their maintenance chow before a 5-min extinction test was conducted on the levers. As shown in the figure (panel A), moderately trained rats showed an reinforcer devaluation effect; those sated on the sucrose outcome reduced performance on the lever relative to those sated on the chow. In contrast, groups of overtrained rats did not differ in the test. Statistically we found a training x devaluation interaction, F(1,28)=7.13, p<0.05, and a significant devaluation effect in the moderately trained, F(1,28)=9.1, p<0.05 but not in the overtrained condition (F<1). (B) For the contingency assessment, after devaluation all rats received a single session of retraining for 30 sucrose deliveries before the moderately trained and overtrained rats were randomly assigned to either an omission group or a yoked, non-contingent control group. During the contingency test the sucrose outcome was no longer delivered contingent on lever pressing and was instead delivered on a fixed time 10 sec schedule. For rats in the omission groups, responses on the lever delayed the sucrose delivery by 10 sec. Rats in the yoked groups received the sucrose at the same time as the omission group except there was no response contingency in place. As is clear from the figure (panel B), rats exposed to the omission contingency in the moderately trained group suppressed lever press performance relative to the non-contingent control whereas those in the overtrained groups did not. Statistically, there was a training x degradation interaction, F(1,28)=5.1, p<0.05, and a significant degradation effect in the moderately trained, F(1,28)=7.8, p<0.05, but not in the overtrained condition (F<1).
Figure 2
Figure 2
(A) A closed-loop control system. After the controller executes an action it receives cues regarding the new state of the environment and a reward. (B) An open-loop control system in which the controller does not receive feedback from the environment.
Figure 3
Figure 3
Environmental constraints on sequence formation. (A) An example of an environment in which action sequences will form. Action A1 leads to two different states with equal probability, in both of which action A2 is the best action, and thus action sequence {A1A2} forms. (B) An example of an environment in which action sequences do not form. Action A1 leads to two different states with equal probability in one of which action A2 is the best action and, in another, action A2 is the best action. As a consequence, an action sequence {A1A2} does not form. (C) An example of an environment in which the process of sequence formation is non-trivial. Action A1 leads to two different states with equal probability in one of which action A2 is best action, but in the other action A2 is the best action (although it is a little bit worse than the rival best action).
Figure 4
Figure 4
Mixed model-based and sequence-based decision-making. At state S0, three actions are available: A1, A2, A3, where A1 and A2 are primitive actions, and A3 is a macro-action composed of primitive actions M1M4. If at state S, the macro action is selected for execution, the action control transfers to the sequence-based controller, and actions M1M4 become executed. After the termination of the macro-action control returns back to model-based decision-making at state S3.
Figure 5
Figure 5
The dynamics of sequence learning in sequential and random trials of SRTT. (A, B) As the learning progresses, the average reward that the agent gains increases, indicative of a high cost of waiting for model-based action selection. At the same time, the cost of sequence-based action selection decreases (top panel), which means that the agent has discovered the correct action sequences. Whenever the cost becomes less than the benefit, a new action sequence forms (bottom panel). The abscissa axis shows the number of action selections. (C) Reaction times decrease in sequential trials as a result of sequence formation but they remain constant in the random trials of SRTT because, as panel (D) shows, no action sequence forms. Data reported are means over 10 runs.
Figure 6
Figure 6
Formal representation of instrumental conditioning tasks. (A) Instrumental learning: By taking the press lever (PL) action, and then enter magazine (EM) action, the agent earns a reward of magnitude one. By taking EM action in state S2 and PL action in state S1, agent remains in the same state (not shown in the figure). (B,C) Reinforcer devaluation: The agent learns that the reward at state S0 is devalued, and then is tested in extinction in which no reward is delivered. (D) Non-contingent training: Unlike panel A, reward is not contingent on the PL action, and the agent can gain the reward only by entering the magazine. (E) Omission training: Taking the PL action causes a delay in the reward delivery, and agent should wait 6s before it can gain the reward by entering the magazine.
Figure 7
Figure 7
Sensitivity of the model to reinforcer devaluation and contingency manipulations before and after sequence formation. (A) In the moderate training condition, actions are selected based on the model-based evaluation (left panel) but, after extended training, the selection of the PL action is potentiated by its previous action (here EM) (right panel). (B) After the devaluation phase (shown by the solid-line), the probability of pressing the lever decreases instantly if the model is moderately trained. The abscissa axis shows number of action selections. (C) After the devaluation phase behavior does not adapt until the action sequence decomposes and control returns to the model-based method. (D) In a moderately trained model the probability of selecting action PL starts to decrease in the contingency degradation condition, although the rate of decrease is greater in the case of omission training. (E) When training is extensive, behavior does not adjust and the non-contingent and omission groups perform at the same rate until the sequence decomposes. Data reported are means over 3000 runs.

References

    1. Adams CD. Variations in the sensitivity of instrumental responding to reinforcer devaluation. The Quarterly Journal of Experimental Psychology Section B. 1982;34B:77–98.
    1. Alexander GE, Crutcher MD. Functional architecture of basal ganglia circuits: neural substrates of parallel processing. Trends in neurosciences. 1990;13(7):266–71. - PubMed
    1. Astrom KJ, Murray RM. Feedback Systems: An Introduction for Scientists and Engineers. Princeton, NJ: Princeton University Press; 2008.
    1. Badre D, Frank MJ. Mechanisms of Hierarchical Reinforcement Learning in Cortico-Striatal Circuits 2: Evidence from fMRI. Cerebral cortex. 2011 in press. - PMC - PubMed
    1. Bailey KR, Mair RG. The role of striatum in initiation and execution of learned action sequences in rats. The Journal of neuroscience: the official journal of the Society for Neuroscience. 2006;26(3):1016–25. - PMC - PubMed

Publication types