Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2012 Dec;22(6):1075-81.
doi: 10.1016/j.conb.2012.08.003. Epub 2012 Sep 6.

The ubiquity of model-based reinforcement learning

Affiliations
Review

The ubiquity of model-based reinforcement learning

Bradley B Doll et al. Curr Opin Neurobiol. 2012 Dec.

Abstract

The reward prediction error (RPE) theory of dopamine (DA) function has enjoyed great success in the neuroscience of learning and decision-making. This theory is derived from model-free reinforcement learning (RL), in which choices are made simply on the basis of previously realized rewards. Recently, attention has turned to correlates of more flexible, albeit computationally complex, model-based methods in the brain. These methods are distinguished from model-free learning by their evaluation of candidate actions using expected future outcomes according to a world model. Puzzlingly, signatures from these computations seem to be pervasive in the very same regions previously thought to support model-free learning. Here, we review recent behavioral and neural evidence about these two systems, in attempt to reconcile their enigmatic cohabitation in the brain.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Sequential task dissociating model-based from model-free learning. (a) A two-step decision making task [33], in which each of two two options (A1, A2) at a start state leads preferentially to one of two subsequent states (A1 to B, A2 to C), where choices (B1 vs. B2 or C1 vs C2) are rewarded stochastically with money. (b, c) Model-free and model-based RL can be distinguished by the pattern of staying vs switching of a top level choice following bottom level winnings. A model-free learner like TD(1) (b), tends to repeat a rewarded action without regard to whether the reward occurred after a common transition (blue, like A1 to B) or a rare one (red). A model-based learner (c) evaluates top-level actions using a model of their likely consequences, so that reward following a rare transition (e.g., A1 to C) actually increases the value of the unchosen option (A2) and thus predicts switching. Human subjects in [33] exhibited a mixture of both effects.
Figure 2
Figure 2
Learning through value generalization (left) and model-based forward planning (right). In a reversal learning task (left), the rat has just taken an action (lever-press) and received no reward, and so updates its internal choice value representation to decrement the chosen value’s option. Because the unchosen value’s option is represented on the same scale, inverted, it is implicitly incremented as well. Implemented this way, learning relies on model-free updating over a modified input, and does not involve explicitly constructing or evaluating a forward model of action consequences. In a model-based RL approach to a maze task (right), the rat has an internal representation of the sequential structure of the maze, and uses it to evaluate a candidate route to the reward.

References

    1. Barto AG. Adaptive critics and the basal ganglia. In: Houk JC, Davis JL, Beiser DG, editors. Models of information processing in the basal ganglia. Cambridge, MA: MIT Press; 1995. pp. 215–232. Ch. xii.
    1. Montague PR, Dayan P, Sejnowski TJ. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci. 1996;16(5):1936–1947. - PMC - PubMed
    1. Thorndike EL. Animal intelligence: An experimental study of the associative processes in animals. Psychological Review Monograph Supplement. 1898;2(4):1–8.
    1. Tolman EC. Cognitive maps in rats and men. Psychol Rev. 1948;55:189–208. - PubMed
    1. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8(12):1704–1711. - PubMed

Publication types