Asymmetric and adaptive reward coding via normalized reinforcement learning

Kenway Louie^{1

2}

Affiliations

¹ Center for Neural Science, New York University, New York, United States of America.
² Neuroscience Institute, New York University Grossman School of Medicine, New York, United States of America.

PMID: 35862443
PMCID: PMC9345478
DOI: 10.1371/journal.pcbi.1010350

Asymmetric and adaptive reward coding via normalized reinforcement learning

Kenway Louie. PLoS Comput Biol. 2022.

. 2022 Jul 21;18(7):e1010350.

doi: 10.1371/journal.pcbi.1010350. eCollection 2022 Jul.

Author

Kenway Louie^{1

2}

Affiliations

¹ Center for Neural Science, New York University, New York, United States of America.
² Neuroscience Institute, New York University Grossman School of Medicine, New York, United States of America.

PMID: 35862443
PMCID: PMC9345478
DOI: 10.1371/journal.pcbi.1010350

Abstract

Learning is widely modeled in psychology, neuroscience, and computer science by prediction error-guided reinforcement learning (RL) algorithms. While standard RL assumes linear reward functions, reward-related neural activity is a saturating, nonlinear function of reward; however, the computational and behavioral implications of nonlinear RL are unknown. Here, we show that nonlinear RL incorporating the canonical divisive normalization computation introduces an intrinsic and tunable asymmetry in prediction error coding. At the behavioral level, this asymmetry explains empirical variability in risk preferences typically attributed to asymmetric learning rates. At the neural level, diversity in asymmetries provides a computational mechanism for recently proposed theories of distributional RL, allowing the brain to learn the full probability distribution of future rewards. This behavioral and computational flexibility argues for an incorporation of biologically valid value functions in computational models of learning and decision-making.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Normalized reinforcement learning model.**
**(a)** Comparison of standard reinforcement learning (RL) and normalized reinforcement learning (NRL) models. RL and NRL differ in how external rewards are transformed by the reward coding function f(R_t) prior to learning internal value estimates. Standard RL uses a linear reward function, while NRL uses a divisively normalized representation. **(b)** Learned value functions under RL and NRL. Left, dynamic value estimates during learning. Right, steady state value estimates. Simulations were performed for each of seven rewards, with additive zero-mean Gaussian noise (learning rate η = 0.1). In contrast to RL, NRL algorithms learn values that are a nonlinear function of external rewards (example NRL simulation parameters: σ = 50, n = 2). **(c)** Convexity and concavity in NRL value functions. Top, NRL value functions with different exponents (fixed σ = 50 A.U.). Bottom, second derivative of value functions. Dots show inflection points between convex and concave value regimes. **(d)** Parametric control of NRL value curvature. NRL value function (top) and second derivative (bottom) for different σ values (fixed n = 2).

**Fig 2. Parametric control of prediction error asymmetry.**
**(a)** Examples of variable reward prediction error (RPE) asymmetry. Each panel shows NRL responses for reward inputs (R = 50 A.U.) with uniformly distributed noise. Lines show piecewise linear regression fits for negative (red) and positive (green) reward errors. **(b)** The NRL semisaturation term governs the degree and direction of RPE asymmetry. NRL RPE asymmetry is biased towards negative RPEs at low σ and postive RPEs at high σ.

**Fig 3. NRL RPE asymmetry governs the degree of risk preference in reward learning.**
**(a)** Examples of risk averse and risk seeking NRL agent behavior. Left, behavior in a task involving choices between a certain (100% chance of 20 A.U.) and a risky (50% chance of 0 or 40 A.U.) option. Blue lines, behavior of the generative NRL agent. Black lines, behavior of best fitting linear RL model with asymmetric learning rates. Right, apparent learning rates for negative and positive RPEs in linear RL model. Example risk averse σ = 20 (top); example risk seeking σ = 60 (bottom). **(b)** Apparent relationship between risk preference and asymmetric learning rates under assumption of linear reward coding. Behavioral risk aversion (percent choice of certain option) and learning rate asymmetry (η^— η ⁺)/ (η ^-+ η ⁺) defined as in previous work [26]. **(c)** Risk preference depends on RPE asymmetry in generative NRL model. Degree of behavioral risk aversion controlled by NRL semisaturation parameter.

**Fig 4. NRL RPE asymmetry provides a computational basis for distributional reinforcement learning.**
**(a)** Variable NRL RPE response asymmetries in a probabilistic reward environment. Examples show NRL agents with stronger negative (blue) and positive (red) RPE asymmetry. Note that these two agents exhibit different reversal points in the same reward environment (rewards = {0.1, 0.3, 1.2, 2.5, 5, 10, 20 μl}, as in previous work [35]). Triangles denote the true average reward (black) and estimated average reward learned by pessimistic (blue) and optimistic (red) NRL agents. **(b)** Learned reversal points vary systematically with NRL parameterization. RPE responses and reversal points quantified for varying σ parameters. **(c)** Reversal points depend on NRL RPE asymmetry. Plots show NRL responses normalized by negative RPE slope and aligned to individual reversal points. As in empirical dopamine data, low (high) reversal points arise from stronger negative (positive) RPE asymmetry. **(d)** NRL asymmetry and learning match empirical dopamine data. Blue, dopamine neurons recorded in stochastic reward environment [35]; black, heterogeneous NRL agents in identical reward environment. Asymmetry is defined as in previous work as a function of positive (α⁺) and negative (α^-) RPE coding slopes. **(e)** A population of NRL agents learns the distribution of experienced rewards. 40 NRL agents were simulated in four different reward environments: symmetric, right-skewed, left-skewed, and multimodal. Each panel plots the ground truth (gray) and decoded (blue) probability densities, with samples smoothed by kernel density estimation. Distribution decoding was performed via an imputation strategy, treating the NRL reversal points and response asymmetries as expectiles.

See this image and copyright information in PMC

Cited by

Dynamics Learning Rate Bias in Pigeons: Insights from Reinforcement Learning and Neural Correlates.
Jin F, Yang L, Yang L, Li J, Li M, Shang Z. Jin F, et al. Animals (Basel). 2024 Feb 1;14(3):489. doi: 10.3390/ani14030489. Animals (Basel). 2024. PMID: 38338131 Free PMC article.
An opponent striatal circuit for distributional reinforcement learning.
Lowet AS, Zheng Q, Meng M, Matias S, Drugowitsch J, Uchida N. Lowet AS, et al. bioRxiv [Preprint]. 2024 Jan 3:2024.01.02.573966. doi: 10.1101/2024.01.02.573966. bioRxiv. 2024. Update in: Nature. 2025 Mar;639(8055):717-726. doi: 10.1038/s41586-024-08488-5. PMID: 38260354 Free PMC article. Updated. Preprint.
Reward prediction error neurons implement an efficient code for reward.
Schütt HH, Kim D, Ma WJ. Schütt HH, et al. Nat Neurosci. 2024 Jul;27(7):1333-1339. doi: 10.1038/s41593-024-01671-x. Epub 2024 Jun 19. Nat Neurosci. 2024. PMID: 38898182
A multidimensional distributional map of future reward in dopamine neurons.
Sousa M, Bujalski P, Cruz BF, Louie K, McNamee DC, Paton JJ. Sousa M, et al. Nature. 2025 Jun;642(8068):691-699. doi: 10.1038/s41586-025-09089-6. Epub 2025 Jun 4. Nature. 2025. PMID: 40468078
The functional form of value normalization in human reinforcement learning.
Bavard S, Palminteri S. Bavard S, et al. Elife. 2023 Jul 10;12:e83891. doi: 10.7554/eLife.83891. Elife. 2023. PMID: 37428155 Free PMC article.

See all "Cited by" articles

References

1. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press; 1998.
1. Botvinick MM, Niv Y, Barto AG. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition. 2009;113(3):262–80. doi: 10.1016/j.cognition.2008.08.011 - DOI - PMC - PubMed
1. Dolan RJ, Dayan P. Goals and habits in the brain. Neuron. 2013;80(2):312–25. doi: 10.1016/j.neuron.2013.09.007 - DOI - PMC - PubMed
1. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al.. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33. doi: 10.1038/nature14236 - DOI - PubMed
1. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, et al.. Mastering the game of Go without human knowledge. Nature. 2017;550(7676):354–9. doi: 10.1038/nature24270 - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Asymmetric and adaptive reward coding via normalized reinforcement learning

Affiliations

Asymmetric and adaptive reward coding via normalized reinforcement learning

Author

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources