Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020:33:3442-3453.

Inferring learning rules from animal decision-making

Affiliations

Inferring learning rules from animal decision-making

Zoe C Ashwood et al. Adv Neural Inf Process Syst. 2020.

Abstract

How do animals learn? This remains an elusive question in neuroscience. Whereas reinforcement learning often focuses on the design of algorithms that enable artificial agents to efficiently learn new tasks, here we develop a modeling framework to directly infer the empirical learning rules that animals use to acquire new behaviors. Our method efficiently infers the trial-to-trial changes in an animal's policy, and decomposes those changes into a learning component and a noise component. Specifically, this allows us to: (i) compare different learning rules and objective functions that an animal may be using to update its policy; (ii) estimate distinct learning rates for different parameters of an animal's policy; (iii) identify variations in learning across cohorts of animals; and (iv) uncover trial-to-trial changes that are not captured by normative learning rules. After validating our framework on simulated choice data, we applied our model to data from rats and mice learning perceptual decision-making tasks. We found that certain learning rules were far more capable of explaining trial-to-trial changes in an animal's policy. Whereas the average contribution of the conventional REINFORCE learning rule to the policy update for mice learning the International Brain Laboratory's task was just 30%, we found that adding baseline parameters allowed the learning rule to explain 92% of the animals' policy updates under our model. Intriguingly, the best-fitting learning rates and baseline values indicate that an animal's policy update, at each trial, does not occur in the direction that maximizes expected reward. Understanding how an animal transitions from chance-level to high-accuracy performance when learning a new task not only provides neuroscientists with insight into their animals, but also provides concrete examples of biological learning algorithms to the machine learning community.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Model schematic. (a) We use a state-space representation with a set of time-varying weights wt, whose change is driven by a learning process as well as noise. (b) Animals usually improve their task performance with continued training, such that their expected reward gradually increases; however, the trial-to-trial change of behavior is not always in the reward-maximizing direction. (c) Considering the animal’s learning trajectory in weight space, we model each step Δwt as a sum of a learning component (ascending the expected reward landscape) and a random noise component.
Figure 2:
Figure 2:
Validation on simulated data. (a) The IBL task [11]: on each trial, a sinusoidal grating (with contrast values between 0 and 100%) appears on either the left or right side of a screen. Mice must report the side of the grating by turning a wheel (left or right) in order to receive a water reward. (b) We simulate a bias weight and stimulus weight (solid lines) which evolve according to our model using the REINFORCE rule, then generate choice data. From the choice data, we successfully recover the weights (dashed lines) with a 95% credible interval (shading). (c) We also successfully recover the underlying hyperparameters from the simulated data (error bars are ±1 posterior SD).(d) We decompose each recovered weight into a learning component (solid lines) and a noise component (dashed lines). Shading shows the cumulative error between the true and recovered components.
Figure 3:
Figure 3:
Result from an example IBL mouse. (a-d) Inferred trial-to-trial weight trajectories for the choice bias (yellow) and contrast sensitivity (purple), recovered under different learning models: (a) RF0, No learning model, with only a noise component to track the changes in behavior with the noise component. This mouse’s bias fluctuates between leftward and rightward choices (negative and positive bias weight), whereas its decision-making is increasingly influenced by the task stimuli (gradually increasing stimulus weight). (b) RF1, REINFORCE with a single learning rate for all weights. (c) RFK, REINFORCE with a separate learning rate for each of the two weights. (d) RFβ, REINFORCE with baselines, where the baseline is also inferred separately for each weight. (e-g) The decomposition of trial-to-trial weight updates into learning and noise components, for the model shown in the same row. The noise component is shown with the dashed line, while the learning component is given by the solid line.
Figure 4:
Figure 4:
Population analysis for 13 IBL mice. (a) The average fraction of the trial-to-trial weight updates along the learning direction, as prescribed by three learning models RF1, RFK, and RFβ. Each open circle represents a mouse; the example mouse from Fig. 3 is marked by a filled circle. The solid bars indicate the mean fraction across the animal cohort. Whereas the mean fraction of animals’ weight updates due to learning is just 0.30 for the RF1 model, it is 0.92 for the RFβ model. (b) The inferred learning rates and baselines, for the contrast and bias weights, from each mouse using the RFβ model. (c) Model comparison across learning rules within RF family, and beyond it (see Sec. 3.5 for a description of AAR and RAR learning rules), in terms of the difference in their Akaike Information Criterion (AIC) relative to the REINFORCE model (RFK). Each line is a mouse, and our example mouse is marked in black. (d) Model comparison within the family of REINFORCE models, with different numbers of varied learning rates. One outlier mouse was excluded from this figure for visibility (the AIC decreased by a massive 126.5 for the RFK model relative to the RF0 model). Our example mouse is marked in black.
Figure 5:
Figure 5:
Weight trajectories plotted on the expected reward landscape for the IBL task. When the animal increases wcontrast and simultaneously decreases wbias to zero, this results in a higher expected reward. (a) The recovered full trajectory for an example IBL mouse over the course of 6000 trials for the RFK model (this is the same trajectory that is shown in Figure 3c). We compare the animal’s trajectory with deterministic trajectories generated (without noise) from the (b) RF1 and (c) RFK learning rules when the learning rates are fixed to those inferred from data.
Figure 6:
Figure 6:
Results from a rat auditory discrimination task [2]. (a) We track an example rat’s choice bias (yellow) and the sensitivity to two stimuli (red, blue) while training on the task described in (b). (b) In this task, a rat hears two tones of different amplitudes (tones A and B) separated by a delay. If tone A is quieter than B, the rat must nose-poke into the left port for reward, and vice-versa if tone A is louder than B. (c) We now use the RFβ model to predict how our rat updates its behavior. (d) The weights from (c) are decomposed into learning (solid) and noise (dashed) components, as in Fig. 3g.

References

    1. Ahmadian Y, Pillow JW, and Paninski L. Efficient markov chain monte carlo methods for decoding neural spike trains. Neural Computation, 23(1):46–96, 2011. ISSN 0899–7667. doi: 10.1162/NECO_a_00059. - DOI - PMC - PubMed
    1. Akrami A, Kopec CD, Diamond ME, and Brody CD. Posterior parietal cortex represents sensory history and mediates its effects on behaviour. Nature, 554(7692):368, 2018. Data available at: 10.6084/m9.figshare.12213671.v1. - DOI - PubMed
    1. Ashwood ZC, Roy NA, Stone IR, Laboratory TIB, Churchland AK, Pouget A, and Pillow JW. Mice alternate between discrete strategies during perceptual decision-making. bioRxiv, page 2020.10.19.346353, Oct. 2020. doi: 10.1101/2020.10.19.346353. URL https://www.biorxiv.org/content/10.1101/2020.10.19.346353v1. - DOI - DOI - PMC - PubMed
    1. Bak JH, Choi JY, Akrami A, Witten I, and Pillow JW. Adaptive optimal training of animal behavior. In Lee DD, Sugiyama M, Luxburg UV, Guyon I, and Garnett R, editors, Advances in Neural Information Processing Systems 29, pages 1947–1955, 2016.
    1. Bishop CM. Pattern recognition and machine learning. Springer, 2006.

LinkOut - more resources