Models that learn how humans learn: The case of decision-making and its disorders

doi:10.1371/journal.pcbi.1006903

. 2019 Jun 11;15(6):e1006903.

doi: 10.1371/journal.pcbi.1006903. eCollection 2019 Jun.

Models that learn how humans learn: The case of decision-making and its disorders

Amir Dezfouli^{1

2}, Kristi Griffiths³, Fabio Ramos⁴, Peter Dayan^{5

6}, Bernard W Balleine¹

Affiliations

¹ School of Psychology, UNSW, Sydney, Australia.
² Data61, CSIRO, Australia.
³ Westmead Institute for Medical Research, University of Sydney, Sydney, Australia.
⁴ University of Sydney, Sydney, Australia.
⁵ Gatsby Computational Neuroscience Unit, UCL, London, United Kingdom.
⁶ Max Planck Institute for Biological Cybernetics, Tübingen, Germany.

PMID: 31185008
PMCID: PMC6588260
DOI: 10.1371/journal.pcbi.1006903

Models that learn how humans learn: The case of decision-making and its disorders

Amir Dezfouli et al. PLoS Comput Biol. 2019.

. 2019 Jun 11;15(6):e1006903.

doi: 10.1371/journal.pcbi.1006903. eCollection 2019 Jun.

Authors

Amir Dezfouli^{1

2}, Kristi Griffiths³, Fabio Ramos⁴, Peter Dayan^{5

6}, Bernard W Balleine¹

Affiliations

¹ School of Psychology, UNSW, Sydney, Australia.
² Data61, CSIRO, Australia.
³ Westmead Institute for Medical Research, University of Sydney, Sydney, Australia.
⁴ University of Sydney, Sydney, Australia.
⁵ Gatsby Computational Neuroscience Unit, UCL, London, United Kingdom.
⁶ Max Planck Institute for Biological Cybernetics, Tübingen, Germany.

PMID: 31185008
PMCID: PMC6588260
DOI: 10.1371/journal.pcbi.1006903

Abstract

Popular computational models of decision-making make specific assumptions about learning processes that may cause them to underfit observed behaviours. Here we suggest an alternative method using recurrent neural networks (RNNs) to generate a flexible family of models that have sufficient capacity to represent the complex learning and decision- making strategies used by humans. In this approach, an RNN is trained to predict the next action that a subject will take in a decision-making task and, in this way, learns to imitate the processes underlying subjects' choices and their learning abilities. We demonstrate the benefits of this approach using a new dataset drawn from patients with either unipolar (n = 34) or bipolar (n = 33) depression and matched healthy controls (n = 34) making decisions on a two-armed bandit task. The results indicate that this new approach is better than baseline reinforcement-learning methods in terms of overall performance and its capacity to predict subjects' choices. We show that the model can be interpreted using off-policy simulations and thereby provides a novel clustering of subjects' learning processes-something that often eludes traditional approaches to modelling and behavioural analysis.

PubMed Disclaimer

Conflict of interest statement

Part of this work was conducted while PD was visiting Uber Technologies. The latter played no role in its design, execution or communication.

Figures

**Fig 1. Structure of the rnn model.**
The model has an lstm layer (shown by red dashed line) which receives the previous action and reward as inputs, and is connected to a softmax layer (shown by a black rectangle) which outputs the probability of selecting each action on the next trial (policy). The lstm layer is composed of a set of lstm cells (N_c cells shown by blue circles), that are connected to each other (shown by green arrows). The outpt of the cells (denoted by $h_{t}^{i}$ for cell i at time t) are connected to a softmax layer using a set of connections shown by black lines. The free parameters of the model (in both lstm and softmax layers) are denoted by Θ, and $L (Θ, rnn)$ is a metric which represents how well the model fits subjects’ data and is used to adjust the parameters of the model using the maximum-likelihood estimate as the network learns how humans learn.

**Fig 2. Structure of the decision-making task.**
Subjects had a choice between a left keypress (L) and a right keypress (R), shown by yellow rectangles. Before the choice, no indication was given as to which button was more likely to lead to reward. When the participant made a rewarded choice, the button chosen was highlighted (green) and a picture of the earned reward was presented for 500ms (M&M chocloate in this case). The task was divided into 12 different blocks each lasting for 40 seconds and separated by a 12-second inter-block interval. Within each block actions were self-paced and participants were free to complete as many trials as they could within the 40 second time limit. The probability of earning a reward from each action was varied between the blocks. See the text for more details about the probabilities of earning rewards from actions.

**Fig 3. Probability of selecting the action with the higher reward probability (averaged over subjects).**
subj refers to the data of the experimental subjects, whereas the remaining columns show simulations of the models trained on the task (on-policy simulations) with the same reward probabilities and for the same number of trials that each subject completed. Each dot represents a subject and error-bars represent 1 SEM.

**Fig 4. Probability of staying on the same action based on whether the previous trial was rewarded (reward) or not rewarded (no reward), averaged over subjects.**
subj shows the data from subjects and results of columns are derived from on-policy simulations of various models on the task. Each dot represents a subject and error-bars represent 1SEM.

**Fig 5. Cross-validation results.**
**(Left-panel)** nlp (negative log-probability) averaged across leave-one-out cross-validation folds. Lower values are better. **(Right-panel)** Percentage of actions predicted correctly averaged over cross-validation folds. Error-bars represent 1SEM.

**Fig 6. Off-policy simulations of all models for group healthy.**
Each panel shows a simulation of 30 trials (horizontal axis), and the vertical axis shows the predictions of each model on each trial. The ribbon below each panel shows the action which was fed to the model on each trial. In the first 10 trials, the action that the model received was R and in the next 20 trials it was L. Rewards are shown by black crosses (x) on the graphs. Red arrows point to the same trial number in all the simulations and are shown to compare changes in the predictions in that trial between different simulations. The sequence of rewards and actions fed to the model are the same for the panels in each column, but they are different across the columns. See text for the interpretation of the graph.

**Fig 7. Off-policy simulations of rnn for all groups.**
Each panel shows a simulation of 30 trials (horizontal axis), and the vertical axis shows the predictions for each group on each trial. The ribbon below each panel shows the action which was fed to the model on each trial. In the first 10 trials, the action that the model received was R and in the next 20 trials it was L. Rewards are shown by black crosses (x) on the graphs, and the red arrows point to the same trial number in all the panels. See text for the interpretation of the graph. Note that the simulation conditions are the same as those shown in Fig 6, and the first row here (healthy group) is the same as the first row shown in Fig 6 which is shown again for comparison with the other groups.

**Fig 8. The effect of the history of previous rewards and actions on the future choices of the subjects.**
**(Left-panel)** The probability of staying with an action after earning reward as a function of the number of rewards earned since switching to the current action (averaged over subjects). Each red dot represents the data for one subject. **(Right-panel)** The probability of staying with an action as a function of the number of actions taken since switching to the current action. The red line was obtained using Loess regression (Local Regression), which is a non-parametric regression approach. The grey area around the red line represents the 95% confidence interval. Error-bars represent 1SEM.

**Fig 9. The median number of actions executed sequentially before switching to another action (run of actions) as a function of the length of the previous run of actions (averaged over subjects).**
The dotted line shows the points at which the length of the previous and the current run of actions were the same. Note that the median was used instead of the average to illustrate the most common ‘current run length’, instead of the average run length for each subject. The results for actual data are shown in subj column, and the remaining columns show the results using the on-policy simulations of the models in the task. Error-bars represent 1SEM.

**Fig 10. Mixed off-policy and on-policy simulations of the models.**
Each panel shows a simulation of 20 trials for which the first nine trials were off-policy and the subsequent trials were on-policy, during which the action with the highest probability was selected. Trials marked with green ribbons were off-policy (actions were fed to the model), whereas the trials marked with blue ribbons were on-policy (actions were selected by the model). The ribbon below each panel shows the actions that were fed to the model (for the first 9 trials), and the actions that were selected by the model (on the subsequent trials). During off-policy trials, the sequence of actions that was fed to the model was R, R, R, R, R, R, L, R, L. See the text for interpretation.

See this image and copyright information in PMC

Cited by

Using deep learning to predict human decisions and using cognitive models to explain deep learning models.
Fintz M, Osadchy M, Hertz U. Fintz M, et al. Sci Rep. 2022 Mar 18;12(1):4736. doi: 10.1038/s41598-022-08863-0. Sci Rep. 2022. PMID: 35304572 Free PMC article.
Revisiting the importance of model fitting for model-based fMRI: It does matter in computational psychiatry.
Katahira K, Toyama A. Katahira K, et al. PLoS Comput Biol. 2021 Feb 9;17(2):e1008738. doi: 10.1371/journal.pcbi.1008738. eCollection 2021 Feb. PLoS Comput Biol. 2021. PMID: 33561125 Free PMC article.
Designing optimal behavioral experiments using machine learning.
Valentin S, Kleinegesse S, Bramley NR, Seriès P, Gutmann MU, Lucas CG. Valentin S, et al. Elife. 2024 Jan 23;13:e86224. doi: 10.7554/eLife.86224. Elife. 2024. PMID: 38261382 Free PMC article.
Distributional dual-process model predicts strategic shifts in decision-making under uncertainty.
Hu M, Don HJ, Worthy DA. Hu M, et al. Commun Psychol. 2025 Apr 14;3(1):61. doi: 10.1038/s44271-025-00249-y. Commun Psychol. 2025. PMID: 40229534 Free PMC article.
Using deep neural networks as a guide for modeling human planning.
Kuperwajs I, Schütt HH, Ma WJ. Kuperwajs I, et al. Sci Rep. 2023 Nov 20;13(1):20269. doi: 10.1038/s41598-023-46850-1. Sci Rep. 2023. PMID: 37985896 Free PMC article.

See all "Cited by" articles

References

1. Busemeyer JR, Diederich A. Cognitive modeling. Sage; 2010.
1. Daw ND. Trial-by-trial data analysis using computational models In: Delgado MR, Phelps EA, Robbins TW, editors. Decision Making, Affect, and Learning. Oxford University Press; 2011.
1. Gold JI, Shadlen MN. The neural basis of decision making. Annual review of neuroscience. 2007;30 10.1146/annurev.neuro.29.051605.113038 - DOI - PubMed
1. Piray P, Zeighami Y, Bahrami F, Eissa AM, Hewedi DH, Moustafa AA. Impulse control disorders in Parkinson’s disease are associated withdysfunction in stimulus valuation but not action valuation. The Journal of neuroscience. 2014;34(23):7814–24. 10.1523/JNEUROSCI.4063-13.2014 - DOI - PMC - PubMed
1. Busemeyer JR, Stout JC. A contribution of cognitive decision models to clinical assessment:decomposing performance on the Bechara gambling task. Psychological assessment. 2002;14(3):253–62 - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

[1] Busemeyer JR, Diederich A. Cognitive modeling. Sage; 2010.

[2] Busemeyer JR, Diederich A. Cognitive modeling. Sage; 2010.

[3] Daw ND. Trial-by-trial data analysis using computational models In: Delgado MR, Phelps EA, Robbins TW, editors. Decision Making, Affect, and Learning. Oxford University Press; 2011.

[4] Daw ND. Trial-by-trial data analysis using computational models In: Delgado MR, Phelps EA, Robbins TW, editors. Decision Making, Affect, and Learning. Oxford University Press; 2011.

[5] Gold JI, Shadlen MN. The neural basis of decision making. Annual review of neuroscience. 2007;30 10.1146/annurev.neuro.29.051605.113038 - DOI - PubMed

[6] Gold JI, Shadlen MN. The neural basis of decision making. Annual review of neuroscience. 2007;30 10.1146/annurev.neuro.29.051605.113038 - DOI - PubMed

[7] Piray P, Zeighami Y, Bahrami F, Eissa AM, Hewedi DH, Moustafa AA. Impulse control disorders in Parkinson’s disease are associated withdysfunction in stimulus valuation but not action valuation. The Journal of neuroscience. 2014;34(23):7814–24. 10.1523/JNEUROSCI.4063-13.2014 - DOI - PMC - PubMed

[8] Piray P, Zeighami Y, Bahrami F, Eissa AM, Hewedi DH, Moustafa AA. Impulse control disorders in Parkinson’s disease are associated withdysfunction in stimulus valuation but not action valuation. The Journal of neuroscience. 2014;34(23):7814–24. 10.1523/JNEUROSCI.4063-13.2014 - DOI - PMC - PubMed

[9] Busemeyer JR, Stout JC. A contribution of cognitive decision models to clinical assessment:decomposing performance on the Bechara gambling task. Psychological assessment. 2002;14(3):253–62 - PubMed

[10] Busemeyer JR, Stout JC. A contribution of cognitive decision models to clinical assessment:decomposing performance on the Bechara gambling task. Psychological assessment. 2002;14(3):253–62 - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Models that learn how humans learn: The case of decision-making and its disorders

Affiliations

Models that learn how humans learn: The case of decision-making and its disorders

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources