Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 8;43(10):1714-1730.
doi: 10.1523/JNEUROSCI.0752-22.2022. Epub 2023 Jan 20.

Nutrient-Sensitive Reinforcement Learning in Monkeys

Affiliations

Nutrient-Sensitive Reinforcement Learning in Monkeys

Fei-Yang Huang et al. J Neurosci. .

Abstract

In reinforcement learning (RL), animals choose by assigning values to options and learn by updating these values from reward outcomes. This framework has been instrumental in identifying fundamental learning variables and their neuronal implementations. However, canonical RL models do not explain how reward values are constructed from biologically critical intrinsic reward components, such as nutrients. From an ecological perspective, animals should adapt their foraging choices in dynamic environments to acquire nutrients that are essential for survival. Here, to advance the biological and ecological validity of RL models, we investigated how (male) monkeys adapt their choices to obtain preferred nutrient rewards under varying reward probabilities. We found that the nutrient composition of rewards strongly influenced learning and choices. Preferences of the animals for specific nutrients (sugar, fat) affected how they adapted to changing reward probabilities; the history of recent rewards influenced choices of the monkeys more strongly if these rewards contained the their preferred nutrients (nutrient-specific reward history). The monkeys also chose preferred nutrients even when they were associated with lower reward probability. A nutrient-sensitive RL model captured these processes; it updated the values of individual sugar and fat components of expected rewards based on experience and integrated them into subjective values that explained the choices of the monkeys. Nutrient-specific reward prediction errors guided this value-updating process. Our results identify nutrients as important reward components that guide learning and choice by influencing the subjective value of choice options. Extending RL models with nutrient-value functions may enhance their biological validity and uncover nutrient-specific learning and decision variables.SIGNIFICANCE STATEMENT RL is an influential framework that formalizes how animals learn from experienced rewards. Although reward is a foundational concept in RL theory, canonical RL models cannot explain how learning depends on specific reward properties, such as nutrients. Intuitively, learning should be sensitive to the nutrient components of the reward to benefit health and survival. Here, we show that the nutrient (fat, sugar) composition of rewards affects how the monkeys choose and learn in an RL paradigm and that key learning variables including reward history and reward prediction error should be modified with nutrient-specific components to account for the choice behavior observed in the monkeys. By incorporating biologically critical nutrient rewards into the RL framework, our findings help advance the ecological validity of RL models.

Keywords: food; learning; nutrients; preference; reward; reward prediction error.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Dynamic foraging task with nutrient-defined rewards. A, Task structure. In each trial, the monkeys were first sequentially presented with two visual cues randomly drawn from a set of four cues and then made a left or right touch choice between these two cues when they were simultaneously presented. Following the touch choice, the animals received a large amount (0.5 ml) or a small amount (0.3 ml) of the associated liquid reward depending on a prespecified reward probability (P). B, Reward design. Four types of liquids with 2 × 2 factorial fat and sugar levels were offered in each session, LFLS (yellow), HFLS (green), LFHS (blue), and HFHS (red). The LFHS and HFLS liquids were isocaloric, and all rewards were matched in flavor (blackcurrant or peach) and other ingredients (e.g., protein, salt, etc.; Table 1). C, Reward-probability schedule. The probabilities of receiving large rewards (reward probability) were assigned in two block types. In block A, LFHS and HFLS were associated with a high probability (P=0.8); LFLS and HFHS were associated with a low reward probability (P=0.2). All reward probabilities were reversed in block B. Each session started with either block A or block B, and the reward probabilities changed between the two block types every 100 trials with typically three to five alternations in each session. D, Choices and reward outcomes in single sessions for monkey Ya (left) and monkey Ym (right). Tick marks represent choices for specific rewards; long marks indicate large-reward outcomes (rewarded trials), and short marks indicate small-reward outcomes (nonrewarded trials). Reward types in dark-gray blocks were associated with high reward probability (P=0.8) and light-gray blocks were associated with low reward probability (P=0.2). Choice-probability curves show nine trial running averages of choices for each reward (N = trials).
Figure 2.
Figure 2.
Nutrient-specific learning and choice patterns across sessions. A, Learning curves. Mean choice frequencies (nine trial running averages ± SEM) aligned to reward probability reversals (dashed line, P=0.20.8) indicate how choices responded to changes in reward probabilities depending on reward nutrient content. N, Number of low to high reward probability reversals. Two-sample t test on the choice probability averaged across the 70th and 80th trials after reversals. B, Learning latency was defined as trial intervals between probability reversals and the first significant change of choice patterns (see Materials and Methods). We included only the first probability reversal across sessions to avoid influences of different prereversal choice probabilities on learning latency. Median ± 95% confidence interval; p values, pairwise two-sided Wilcoxon rank sum test. C, Reward preference. Averaged choice frequencies (mean ± SEM) indicate preferences for the four reward types. The choice frequencies were aggregated across sessions, which were truncated to have balanced block types (see Materials and Methods); p values, pairwise two-sample binomial test. N = trial numbers. D, Initial learning from novel visual cues. Choices and reward outcomes in the initial 30 trials of two example sessions for monkey Ya (left, block type A) and monkey Ym (right, block type B) show how the monkeys differentially associated novel visual cues to the reward types while learning from reward outcomes. Tick marks represent choices for specific rewards; long marks indicate large-reward outcomes (rewarded trials), and short marks indicate small-reward outcomes (nonrewarded trials). Reward types in dark-gray blocks were associated with high reward probability (P=0.8) and light-gray blocks were associated with low reward probability (P=0.2). Choice-probability curves show nine trial running averages of choices for each reward. E, History-dependent choice probabilities. The probability of choosing each reward (mean ± SEM) increased after choosing and receiving the specific reward. Such influence of reward history depended on the fat and sugar content of the reward; p values, pairwise two-sample binomial test. The conditional probability of choosing each reward was computed based on the outcomes when the reward was last offered. Left, The reward was not chosen (Nonchosen). Trial numbers LFLS, HFLS, LFHS, HFHS = 1775, 1909, 622, 342 (Ya); 1677, 1507, 1117, 1042 (Ym). Middle, The reward was chosen, but only a small reward was delivered (Nonrewarded). Trial numbers LFLS, HFLS, LFHS, HFHS = 472, 425, 1707, 1996 (Ya); 967, 1171, 1545, 1616 (Ym), Right, The reward was chosen and a large reward was delivered (Rewarded). Trial numbers LFLS, HFLS, LFHS, HFHS = 231, 206, 971, 1099 (Ya); 563, 704, 859, 953 (Ym).
Figure 3.
Figure 3.
Nutrient-dependent influences of reward and choice history on choice. A, B, General reward- and choice-history effects in logistic regression. Reward-history effects in logistic regression (A). Regression coefficients (± SEM) for reward history, modeled across all stimuli, reveal baseline effects of recent large versus small rewards on choice in the last 10 trials preceding current-trial choice. Filled bars, p < 0.05; gray bars, not significant. Choice history effects in logistic regression (B). Regression coefficients for choice history, obtained from the same logistic regression model as in A, reveal the baseline effect of recent choice on current-trial choice regardless of reward outcome. C–G, Nutrient-dependent logistic regression model. Schematic of reward and choice history logistic regression model (C). Prior trials were indexed only when the same type of reward was offered. Current-trial choices were explained by recently rewarded and chosen rewards. Nutrient-specific history regressors (D). In the nutrient-specific logistic regression model (Nutrient model), each reward outcome and choice on trial t was decomposed into effects of baseline low-nutrient ingredients (Bt, gray), additional fat content (Ft, green), and additional sugar content (St, blue). Model comparison across history trial lengths (E). Performance of the Nutrient model (blue) and History model (gray) matched in history trial lengths up to the past 10 trials. Models were compared based on ΔAIC = AIC (history trial length = 0) – AIC (history trial length = i, i=1,2,…,10) and confirmed by the loglikelihood test (p value). Nutrient-specific reward- and choice-history effects (F,G). Effects of nutrient-specific reward history (F). Regression coefficients for recent low-nutrient baseline rewards (yellow), fat-containing rewards (green), and sugar-containing rewards (blue) on current-trial choice. Nutrient-specific effects were estimated in the same model as low-nutrient baseline effects; thus, effects of fat and sugar reward history were not accounted for by general effects of reward history. Effects of nutrient-specific choice history (G). Regression coefficients for recent low-nutrient baseline choice (yellow), the choice for fat-containing rewards (green), and the choice for sugar-containing rewards (blue) on current-trial choice. Nutrient-specific effects were estimated in the same model as low-nutrient baseline effects; thus, effects of fat and sugar choice history were not accounted for by general effects of choice history.
Figure 4.
Figure 4.
Nutrient-sensitive reinforcement learning. A, Nutrient-sensitive RL model (NutVal-Forget model; see above, Materials and Methods). Expected values for each option (Qi) were iteratively updated based on subjective reward values, Vi, constructed by animal-specific values for high-fat content (VF), high-sugar content (VS), and fat–sugar interaction (VFS), compared with the low-nutrient baseline reward (V=1). IF(t) and IS(t) indicate the fat and sugar levels of the reward, respectively. Qi refers to the value for reward i, which was updated depending both on the reward outcome on trial t, R(t), and the reward type received on trial t, i(t). All reward values were normalized to the highest reward value (VFVSVFS) to be constrained between 0 and 1. Values for unrewarded and unoffered options decayed by a factor of (1δ). δ, forgetting rate [0,1]. B, Subjective nutrient values. Fitted parameters indicating subjective values for fat, sugar, and fat–sugar interaction in each session fitted by the NutVal-Forget model. Data were log-transformed (base 2) for visualization; p value, Wilcoxon signed-rank test. Gray lines indicate reference values as null hypotheses for each parameter. C, Nutrient-specific learning rates (compared with low-nutrient baseline αL) and the value-forgetting rate δ (compared with δ=0, perfect value memory), fitted in the NutValAlpha-Forget model; p value, Wilcoxon signed-rank test. Gray lines indicate reference values as null hypotheses for each parameter. D, Temporal dynamics of nutrient values. Nutrient values were plotted across chronological sessions for both monkeys. E, Nutrient-based learning behavior of the NutVal-Forget model. Simulated choices based on the NutVal-Forget model reproduced nutrient-specific learning observed in Figure 2E. The probability of choosing each reward (mean ± SEM) increased with previous reward outcomes but to different extents depending on reward fat and sugar content; p values, pairwise two-sample binomial test. F, Model comparison across RL models including the nine combinatorial RL models, the NutRPE model (Fig. 5) and the Energy model (Fig. 4G). The model comparison was conducted based on the AIC across testing sessions (mean ± SEM) for monkey Ym (blue) and monkey Ym (orange). The gray line indicates the statistical decision threshold (relative likelihood of a given model < 0.05; see above, Materials and Methods) compared with the best-fitting NutVal-Forget model (red arrowhead). Comparisons between any two of the other models can also be performed by taking their AIC differences. G, Psychometric curves relating model-derived reward values to choice probability. Model-fitted reward values from the NutVal-Forget model (black) outperformed the Energy model (Energy RL, red) and performed equally well as the ObjVal model in explaining choices of the monkeys (see above, Materials and Methods). Inset, ΔAIC=AICEnergyAICNutrient. Pcorr, Percentage correctly modeled choices ± SEM pR2, pseudo-R2 ± SEM; *p < 0.05, **p < 0.01, ***p < 0.001. n.s.: not significant.
Figure 5.
Figure 5.
Model recovery analysis. A, B, We evaluated the sensitivity and reliability of our model comparison based on (A) the proportion of correctly recovered models from a specified simulated model (confusion matrix) and (B) the confidence of best-model predictions using our approach (inversion matrix; see above, Materials and Methods). Colors indicate the conditional probabilities stated above each matrix. Left, The nine combinatorial RL models. Red boxes in the matrices indicate data for our best-fitting NutVal-Forget model. The model comparison correctly identified 82% of the simulated sessions from the NutVal-Forget model (compare the conditional probabilities in the sixth row of A). Additionally, the NutVal-Forget model was the most probable generative model when predicted as the best-fitting model (compare the posterior probabilities in the sixth column of B).
Figure 6.
Figure 6.
Parameter recovery analysis. A–H, Correlations between simulated and fitted parameters for each of the eight free parameters in the NutVal-Forget model, including (A) learning rate, α; (B) inverse temperature, β; (C) side bias, β0; (D) value-forgetting rate, δ; (E) discount factor, d; (F) fat value, VF; (G) sugar value, VS; and (H) fat–sugar interaction, VFS. The simulation was performed based on session-specific fitted parameters from each monkey to approximate the valid range of parameters (see above, Materials and Methods). The value parameters in F–H were log transformed (base 2). Dashed line, Unity line; red line, least-square regression line. Inset, Slope, the fitted slope of the regression line; p value was estimated based on the least-square linear fit of the data points. I, Parameter trade-off. Cross-correlation between the eight fitted parameters in A–H identified mutual dependence between free parameters in the NutVal-Forget model.
Figure 7.
Figure 7.
Learning with nutrient-specific reward prediction errors. A, NutRPE model. Subjective values for fat (top, green) and sugar (bottom, blue) were updated based on differences between reward outcomes and expected nutrient values (NPEs); fat values and sugar values were multiplied into integrated reward values to guide choices. B, Single-session data for choices, rewards, and modeled choice probabilities based on the NutRPE model (monkey Ym). Top, Model-predicted choices (faint lines) tracked the actual choices of monkey Ym (thick lines) for nutrient-defined rewards. Bottom, The trial-by-trial record of choices and reward outcomes. Tick marks represent choices for each reward, long marks indicate large-reward outcomes (rewarded trials); short marks indicate small-reward outcomes (nonrewarded trials). C, Psychometric curves based on value estimates from the NutRPE model accounted for the choices of the monkeys with good model-fit indicators. Pcorr, Percentage correctly modeled choices ± SEM pR2: pseudo-R2 ± SEM. D, Single-session dynamics of nutrient values. Reward values in monkey Ya were dominated by the sugar values. In the NutRPE model, these sugar values were updated after choices of high-sugar liquids and tracked blockwise changes in reward probability. In monkey Ym, both fat values and sugar values contributed to value updating and tracked fluctuating reward probabilities. E, Sugar prediction errors in the single session shown in B. Prediction errors were sensitive to sugar content and reward size, determined by model-derived nutrient value parameters fitted to the choices of the monkey. Single-session data in B, D, E are the same sessions as in Fig. 1D (right).
Figure 8.
Figure 8.
Hypothesized neuron types encoding nutrient-specific learning and decision variables. A, Fat value neurons signal the trial-by-trial fat-specific value component and update their activity based on fat-specific reward prediction errors. B, A similar process operates for sugar-value neurons. C, Inputs from fat and sugar value neurons may converge onto reward value neurons that signal integrated, scalar value, depending on the subjective nutrient preferences and integrated reward prediction errors to guide learning and choice.

References

    1. Averbeck BB, Murray EA (2020) Hypothalamic interactions with large-scale neural circuits underlying reinforcement learning and motivated behavior. Trends Neurosci 43:681–694. 10.1016/j.tins.2020.06.006 - DOI - PMC - PubMed
    1. Burnham KP, Anderson DR, Huyvaert KP (2011) AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons. Behav Ecol Sociobiol 65:23–35. 10.1007/s00265-010-1029-6 - DOI
    1. Carreiro AL, Dhillon J, Gordon S, Higgins KA, Jacobs AG, McArthur BM, Redan BW, Rivera RL, Schmidt LR, Mattes RD (2016) The macronutrients, appetite, and energy intake. Annu Rev Nutr 36:73–103. 10.1146/annurev-nutr-121415-112624 - DOI - PMC - PubMed
    1. Corrado GS, Sugrue LP, Seung HS, Newsome WT (2005) Linear-nonlinear-Poisson models of primate choice dynamics. J Exp Anal Behav 84:581–617. 10.1901/jeab.2005.23-05 - DOI - PMC - PubMed
    1. Costa VD, Mitz AR, Averbeck BB (2019) Subcortical substrates of explore-exploit decisions in primates. Neuron 103:533–545.e5. - PMC - PubMed

Publication types

LinkOut - more resources