. 2020:33:3442-3453.

Inferring learning rules from animal decision-making

Zoe C Ashwood^{1

2}, Nicholas A Roy¹, Ji Hyun Bak³; International Brain Laboratory; Jonathan W Pillow^{1

4}

Affiliations

¹ Princeton Neuroscience Institute, Princeton University.
² Dept. of Computer Science, Princeton University.
³ Redwood Center for Theoretical Neuroscience, UC Berkeley.
⁴ Dept. of Psychology, Princeton University.

PMID: 36177341
PMCID: PMC9518972

Inferring learning rules from animal decision-making

Zoe C Ashwood et al. Adv Neural Inf Process Syst. 2020.

. 2020:33:3442-3453.

Authors

Zoe C Ashwood^{1

2}, Nicholas A Roy¹, Ji Hyun Bak³; International Brain Laboratory; Jonathan W Pillow^{1

4}

Affiliations

¹ Princeton Neuroscience Institute, Princeton University.
² Dept. of Computer Science, Princeton University.
³ Redwood Center for Theoretical Neuroscience, UC Berkeley.
⁴ Dept. of Psychology, Princeton University.

PMID: 36177341
PMCID: PMC9518972

Abstract

How do animals learn? This remains an elusive question in neuroscience. Whereas reinforcement learning often focuses on the design of algorithms that enable artificial agents to efficiently learn new tasks, here we develop a modeling framework to directly infer the empirical learning rules that animals use to acquire new behaviors. Our method efficiently infers the trial-to-trial changes in an animal's policy, and decomposes those changes into a learning component and a noise component. Specifically, this allows us to: (i) compare different learning rules and objective functions that an animal may be using to update its policy; (ii) estimate distinct learning rates for different parameters of an animal's policy; (iii) identify variations in learning across cohorts of animals; and (iv) uncover trial-to-trial changes that are not captured by normative learning rules. After validating our framework on simulated choice data, we applied our model to data from rats and mice learning perceptual decision-making tasks. We found that certain learning rules were far more capable of explaining trial-to-trial changes in an animal's policy. Whereas the average contribution of the conventional REINFORCE learning rule to the policy update for mice learning the International Brain Laboratory's task was just 30%, we found that adding baseline parameters allowed the learning rule to explain 92% of the animals' policy updates under our model. Intriguingly, the best-fitting learning rates and baseline values indicate that an animal's policy update, at each trial, does not occur in the direction that maximizes expected reward. Understanding how an animal transitions from chance-level to high-accuracy performance when learning a new task not only provides neuroscientists with insight into their animals, but also provides concrete examples of biological learning algorithms to the machine learning community.

PubMed Disclaimer

Figures

**Figure 1:**
Model schematic. **(a)** We use a state-space representation with a set of time-varying weights w_t, whose change is driven by a learning process as well as noise. **(b)** Animals usually improve their task performance with continued training, such that their expected reward gradually increases; however, the trial-to-trial change of behavior is not always in the reward-maximizing direction. **(c)** Considering the animal’s learning trajectory in weight space, we model each step Δw_t as a sum of a learning component (ascending the expected reward landscape) and a random noise component.

**Figure 2:**
Validation on simulated data. **(a)** The IBL task [11]: on each trial, a sinusoidal grating (with contrast values between 0 and 100%) appears on either the left or right side of a screen. Mice must report the side of the grating by turning a wheel (left or right) in order to receive a water reward. **(b)** We simulate a bias weight and stimulus weight (solid lines) which evolve according to our model using the REINFORCE rule, then generate choice data. From the choice data, we successfully recover the weights (dashed lines) with a 95% credible interval (shading). **(c)** We also successfully recover the underlying hyperparameters from the simulated data (error bars are ±1 posterior SD).(d) We decompose each recovered weight into a learning component (solid lines) and a noise component (dashed lines). Shading shows the cumulative error between the true and recovered components.

**Figure 3:**
Result from an example IBL mouse. **(a-d)** Inferred trial-to-trial weight trajectories for the choice bias (yellow) and contrast sensitivity (purple), recovered under different learning models: **(a)** RF₀, No learning model, with only a noise component to *track* the changes in behavior with the noise component. This mouse’s bias fluctuates between leftward and rightward choices (negative and positive bias weight), whereas its decision-making is increasingly influenced by the task stimuli (gradually increasing stimulus weight). **(b)** RF₁, REINFORCE with a single learning rate for all weights. **(c)** RF_K, REINFORCE with a separate learning rate for each of the two weights. **(d)** RF_β, REINFORCE with baselines, where the baseline is also inferred separately for each weight. **(e-g)** The decomposition of trial-to-trial weight updates into learning and noise components, for the model shown in the same row. The noise component is shown with the dashed line, while the learning component is given by the solid line.

**Figure 4:**
Population analysis for 13 IBL mice. **(a)** The average fraction of the trial-to-trial weight updates along the learning direction, as prescribed by three learning models RF₁, RF_K, and RF_β. Each open circle represents a mouse; the example mouse from Fig. 3 is marked by a filled circle. The solid bars indicate the mean fraction across the animal cohort. Whereas the mean fraction of animals’ weight updates due to learning is just 0.30 for the RF₁ model, it is 0.92 for the RF_β model. **(b)** The inferred learning rates and baselines, for the contrast and bias weights, from each mouse using the RF_β model. **(c)** Model comparison across learning rules within RF family, and beyond it (see Sec. 3.5 for a description of AAR and RAR learning rules), in terms of the difference in their Akaike Information Criterion (AIC) relative to the REINFORCE model (RF_K). Each line is a mouse, and our example mouse is marked in black. **(d)** Model comparison within the family of REINFORCE models, with different numbers of varied learning rates. One outlier mouse was excluded from this figure for visibility (the AIC decreased by a massive 126.5 for the RF_K model relative to the RF₀ model). Our example mouse is marked in black.

**Figure 5:**
Weight trajectories plotted on the expected reward landscape for the IBL task. When the animal increases w_contrast and simultaneously decreases w_bias to zero, this results in a higher expected reward. **(a)** The recovered full trajectory for an example IBL mouse over the course of 6000 trials for the RF_K model (this is the same trajectory that is shown in Figure 3c). We compare the animal’s trajectory with deterministic trajectories generated (without noise) from the **(b)** RF₁ and **(c)** RF_K learning rules when the learning rates are fixed to those inferred from data.

**Figure 6:**
Results from a rat auditory discrimination task [2]. **(a)** We track an example rat’s choice bias (yellow) and the sensitivity to two stimuli (red, blue) while training on the task described in (b). **(b)** In this task, a rat hears two tones of different amplitudes (tones A and B) separated by a delay. If tone A is quieter than B, the rat must nose-poke into the left port for reward, and vice-versa if tone A is louder than B. **(c)** We now use the RF_β model to predict how our rat updates its behavior. **(d)** The weights from (c) are decomposed into learning (solid) and noise (dashed) components, as in Fig. 3g.

See this image and copyright information in PMC

References

1. Ahmadian Y, Pillow JW, and Paninski L. Efficient markov chain monte carlo methods for decoding neural spike trains. Neural Computation, 23(1):46–96, 2011. ISSN 0899–7667. doi: 10.1162/NECO_a_00059. - DOI - PMC - PubMed
1. Akrami A, Kopec CD, Diamond ME, and Brody CD. Posterior parietal cortex represents sensory history and mediates its effects on behaviour. Nature, 554(7692):368, 2018. Data available at: 10.6084/m9.figshare.12213671.v1. - DOI - PubMed
1. Ashwood ZC, Roy NA, Stone IR, Laboratory TIB, Churchland AK, Pouget A, and Pillow JW. Mice alternate between discrete strategies during perceptual decision-making. bioRxiv, page 2020.10.19.346353, Oct. 2020. doi: 10.1101/2020.10.19.346353. URL https://www.biorxiv.org/content/10.1101/2020.10.19.346353v1. - DOI - DOI - PMC - PubMed
1. Bak JH, Choi JY, Akrami A, Witten I, and Pillow JW. Adaptive optimal training of animal behavior. In Lee DD, Sugiyama M, Luxburg UV, Guyon I, and Garnett R, editors, Advances in Neural Information Processing Systems 29, pages 1947–1955, 2016.
1. Bishop CM. Pattern recognition and machine learning. Springer, 2006.

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Inferring learning rules from animal decision-making

Affiliations

Inferring learning rules from animal decision-making

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources