Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

Daniel Bennett^{1

2}, Yael Niv^{1

3}, Angela J Langdon¹

Affiliations

¹ Princeton Neuroscience Institute, Princeton University, USA.
² Department of Psychiatry, Monash University, Australia.
³ Department of Psychology, Princeton University, USA.

PMID: 36341023
PMCID: PMC9635588
DOI: 10.1016/j.cobeha.2021.04.020

Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

Daniel Bennett et al. Curr Opin Behav Sci. 2021 Oct.

. 2021 Oct:41:114-121.

doi: 10.1016/j.cobeha.2021.04.020. Epub 2021 May 28.

Authors

Daniel Bennett^{1

2}, Yael Niv^{1

3}, Angela J Langdon¹

Affiliations

¹ Princeton Neuroscience Institute, Princeton University, USA.
² Department of Psychiatry, Monash University, Australia.
³ Department of Psychology, Princeton University, USA.

PMID: 36341023
PMCID: PMC9635588
DOI: 10.1016/j.cobeha.2021.04.020

Abstract

Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses value-based reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a behavioral policy directly, without the intermediate step of value-learning. Here we review recent behavioral and neural findings that are more parsimoniously explained by policy-gradient models than by value-based models. We conclude that, despite the ubiquity of 'value' in reinforcement-learning models of decision making, policy-gradient models provide a lightweight and compelling alternative model of operant behavior.

Keywords: computational modelling; decision-making; policy gradient; reinforcement learning; value.

PubMed Disclaimer

Figures

**Figure 1:**
Update schematics for example value-based and policy-gradient RL algorithms. Shaded diamond nodes denote observable variables, unshaded circular nodes denote latent variables that are internal to the RL agent, and arrows denote dependencies. For simplicity, in these algorithms we do not show the environmental state, which would be an additional (potentially partially) observable variable. A: in a value-based RL algorithm (such as the Q-learning model presented here), actions (a, chosen from a discrete set A) are a product of the agent’s policy π, which in turn is determined (dotted cyan arrow) by the learned action-values (Q). The update rule for action-values (dashed green arrow) depends on the action-values and received reward (r) at the previous timestep, and only indirectly on the policy. This algorithm has two adjustable parameters: the learning rate α and the softmax inverse temperature β. B: a policy-gradient algorithm (such as the gradient-bandit algorithm presented here; see [13]) selects actions according to a parameterised policy π^θ, and updates the parameters θ of this policy directly (dashed magenta arrow; in the gradient-bandit algorithm, θ is a vector of action preferences), without the intermediate step of learning action-values. In the policy-gradient algorithm, by contrast with the value-based algorithm, the size of the update to θ depends more directly on the current policy, since the size of the update to each action preference is scaled by the probability of that action under the policy.

See this image and copyright information in PMC

References

1. O’Doherty JP, The problem with value, Neuroscience & Biobehavioral Reviews 43 (2014) 259–268. - PMC - PubMed
1. Miller KJ, Shenhav A, Ludvig EA, Habits without values, Psychological Review 126 (2019) 292–311. - PMC - PubMed
2. * This paper presents evidence that habit formation—a process previously linked with model-free learning of action-values—may be instead produced by a value-free process in which choosing an action directly strengthens its future choice probability. Although this is not strictly speaking a policy-gradient model, it nevertheless points to the feasibility of explaining operant behavior in terms of modulation of a policy, without recourse to the explanatory construct of value.
1. Juechems K, Summerfield C, Where does value come from?, Trends in Cognitive Sciences 23 (2019) 836–850. - PubMed
1. Suri G, Gross JJ, McClelland JL, Value-based decision making: An interactive activation perspective, Psychological Review 127 (2020) 153. - PubMed
1. Hayden B, Niv Y, The case against economic values in the brain, 2020. Preprint hosted at PsyArXiv. - PubMed

Grants and funding

R01 MH119511/MH/NIMH NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

Affiliations

Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources