Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 21;18(7):e1010350.
doi: 10.1371/journal.pcbi.1010350. eCollection 2022 Jul.

Asymmetric and adaptive reward coding via normalized reinforcement learning

Affiliations

Asymmetric and adaptive reward coding via normalized reinforcement learning

Kenway Louie. PLoS Comput Biol. .

Abstract

Learning is widely modeled in psychology, neuroscience, and computer science by prediction error-guided reinforcement learning (RL) algorithms. While standard RL assumes linear reward functions, reward-related neural activity is a saturating, nonlinear function of reward; however, the computational and behavioral implications of nonlinear RL are unknown. Here, we show that nonlinear RL incorporating the canonical divisive normalization computation introduces an intrinsic and tunable asymmetry in prediction error coding. At the behavioral level, this asymmetry explains empirical variability in risk preferences typically attributed to asymmetric learning rates. At the neural level, diversity in asymmetries provides a computational mechanism for recently proposed theories of distributional RL, allowing the brain to learn the full probability distribution of future rewards. This behavioral and computational flexibility argues for an incorporation of biologically valid value functions in computational models of learning and decision-making.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Normalized reinforcement learning model.
(a) Comparison of standard reinforcement learning (RL) and normalized reinforcement learning (NRL) models. RL and NRL differ in how external rewards are transformed by the reward coding function f(Rt) prior to learning internal value estimates. Standard RL uses a linear reward function, while NRL uses a divisively normalized representation. (b) Learned value functions under RL and NRL. Left, dynamic value estimates during learning. Right, steady state value estimates. Simulations were performed for each of seven rewards, with additive zero-mean Gaussian noise (learning rate η = 0.1). In contrast to RL, NRL algorithms learn values that are a nonlinear function of external rewards (example NRL simulation parameters: σ = 50, n = 2). (c) Convexity and concavity in NRL value functions. Top, NRL value functions with different exponents (fixed σ = 50 A.U.). Bottom, second derivative of value functions. Dots show inflection points between convex and concave value regimes. (d) Parametric control of NRL value curvature. NRL value function (top) and second derivative (bottom) for different σ values (fixed n = 2).
Fig 2
Fig 2. Parametric control of prediction error asymmetry.
(a) Examples of variable reward prediction error (RPE) asymmetry. Each panel shows NRL responses for reward inputs (R = 50 A.U.) with uniformly distributed noise. Lines show piecewise linear regression fits for negative (red) and positive (green) reward errors. (b) The NRL semisaturation term governs the degree and direction of RPE asymmetry. NRL RPE asymmetry is biased towards negative RPEs at low σ and postive RPEs at high σ.
Fig 3
Fig 3. NRL RPE asymmetry governs the degree of risk preference in reward learning.
(a) Examples of risk averse and risk seeking NRL agent behavior. Left, behavior in a task involving choices between a certain (100% chance of 20 A.U.) and a risky (50% chance of 0 or 40 A.U.) option. Blue lines, behavior of the generative NRL agent. Black lines, behavior of best fitting linear RL model with asymmetric learning rates. Right, apparent learning rates for negative and positive RPEs in linear RL model. Example risk averse σ = 20 (top); example risk seeking σ = 60 (bottom). (b) Apparent relationship between risk preference and asymmetric learning rates under assumption of linear reward coding. Behavioral risk aversion (percent choice of certain option) and learning rate asymmetry (η η +)/ (η -+ η +) defined as in previous work [26]. (c) Risk preference depends on RPE asymmetry in generative NRL model. Degree of behavioral risk aversion controlled by NRL semisaturation parameter.
Fig 4
Fig 4. NRL RPE asymmetry provides a computational basis for distributional reinforcement learning.
(a) Variable NRL RPE response asymmetries in a probabilistic reward environment. Examples show NRL agents with stronger negative (blue) and positive (red) RPE asymmetry. Note that these two agents exhibit different reversal points in the same reward environment (rewards = {0.1, 0.3, 1.2, 2.5, 5, 10, 20 μl}, as in previous work [35]). Triangles denote the true average reward (black) and estimated average reward learned by pessimistic (blue) and optimistic (red) NRL agents. (b) Learned reversal points vary systematically with NRL parameterization. RPE responses and reversal points quantified for varying σ parameters. (c) Reversal points depend on NRL RPE asymmetry. Plots show NRL responses normalized by negative RPE slope and aligned to individual reversal points. As in empirical dopamine data, low (high) reversal points arise from stronger negative (positive) RPE asymmetry. (d) NRL asymmetry and learning match empirical dopamine data. Blue, dopamine neurons recorded in stochastic reward environment [35]; black, heterogeneous NRL agents in identical reward environment. Asymmetry is defined as in previous work as a function of positive (α+) and negative (α-) RPE coding slopes. (e) A population of NRL agents learns the distribution of experienced rewards. 40 NRL agents were simulated in four different reward environments: symmetric, right-skewed, left-skewed, and multimodal. Each panel plots the ground truth (gray) and decoded (blue) probability densities, with samples smoothed by kernel density estimation. Distribution decoding was performed via an imputation strategy, treating the NRL reversal points and response asymmetries as expectiles.

Similar articles

Cited by

References

    1. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press; 1998.
    1. Botvinick MM, Niv Y, Barto AG. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition. 2009;113(3):262–80. doi: 10.1016/j.cognition.2008.08.011 - DOI - PMC - PubMed
    1. Dolan RJ, Dayan P. Goals and habits in the brain. Neuron. 2013;80(2):312–25. doi: 10.1016/j.neuron.2013.09.007 - DOI - PMC - PubMed
    1. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al.. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33. doi: 10.1038/nature14236 - DOI - PubMed
    1. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, et al.. Mastering the game of Go without human knowledge. Nature. 2017;550(7676):354–9. doi: 10.1038/nature24270 - DOI - PubMed