Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Feb 19;105(4):700-711.e6.
doi: 10.1016/j.neuron.2019.11.018. Epub 2019 Dec 16.

Dopaminergic and Prefrontal Basis of Learning from Sensory Confidence and Reward Value

Affiliations

Dopaminergic and Prefrontal Basis of Learning from Sensory Confidence and Reward Value

Armin Lak et al. Neuron. .

Abstract

Deciding between stimuli requires combining their learned value with one's sensory confidence. We trained mice in a visual task that probes this combination. Mouse choices reflected not only present confidence and past rewards but also past confidence. Their behavior conformed to a model that combines signal detection with reinforcement learning. In the model, the predicted value of the chosen option is the product of sensory confidence and learned value. We found precise correlates of this variable in the pre-outcome activity of midbrain dopamine neurons and of medial prefrontal cortical neurons. However, only the latter played a causal role: inactivating medial prefrontal cortex before outcome strengthened learning from the outcome. Dopamine neurons played a causal role only after outcome, when they encoded reward prediction errors graded by confidence, influencing subsequent choices. These results reveal neural signals that combine reward value with sensory confidence and guide subsequent learning.

Keywords: Calcium imaging; Decision confidence; Electrophysiology; Mice; Optogenetics; Psychophysics; Reinforcement learning.

PubMed Disclaimer

Conflict of interest statement

Declaration of Interests The authors declare no competing interests.

Figures

Figure 1
Figure 1
Behavioral and Computational Signatures of Decisions Guided by Reward Value and Sensory Confidence (A and B) Schematic of the 2-alternative visual task. After the mouse kept the wheel still for at least 0.5 s, a sinusoidal grating stimulus of varying contrast appeared on either the left or right monitor, together with a brief tone (0.1 s, 12 kHz) indicating that the trial had started. The mouse reported the choice by steering the wheel located underneath its forepaws. (C) Rewards for correct choices were higher on the right side (orange) or on the left side (brown), with the more-rewarded side switching in blocks of 50–350 trials. (D) Choices of an example mouse in blocks with large reward on right (orange) or left (brown). Curves in this and subsequent panels are predictions of the behavioral model in (G) and (H), and error bars show SE across trials. See Figures S1B–S1D for similar results from all mice, for learning curves and for reaction times. (E) Choices of the same mouse depend on whether the previous rewarded trials were difficult (low contrast) or easy (high contrast). (F) Average change in the proportion of rightward choices after correct decisions in difficult (black) and easy (gray) choices, averaged across mice. (G and H) Behavioral model of choice (G) and learning (H). (I) Running average of probability of choosing right, in a session containing four blocks (orange versus brown). Black: mouse behavior. Light purple: model predictions. (J) Averaged estimates of QC as a function of absolute contrast (i.e., regardless of side), for correct decisions toward the large-reward side (dark green) and correct decisions toward the small-reward side (light green). (K) Averaged estimates of QC for correct decisions (dark green) versus incorrect decisions (red), both made toward the large-reward side. See Figure S1J for errors toward small-reward side. (L and M) Similar to (J) and (K) but for reward prediction error δ.
Figure 2
Figure 2
Medial Prefrontal Neurons Encode Confidence-Dependent Predicted Value (A) Histological image showing the high-density silicon probe track in mPFC. (B) Raster plot showing spikes of an example mPFC neuron, aligned to the stimulus onset (blue line) with trials sorted by action onset (purple dots). (C) Responses of all task-responsive neurons (n = 316), aligned to the time of stimulus, action, or outcome, sorted according to the time of maximum response in the middle panel. Responses were Z scored and averaged over all stimulus contrasts and possible outcomes. (D) Same as the middle panel of (C) for trials with maximum stimulus contrast with left or right actions. (E) Mean population activity (n = 316 neurons) triggered on action onset for correct choices toward the large-reward side (left), correct choices toward the small-reward side (middle), and incorrect choices toward the large-reward side (right). Responses for incorrect choices toward the small-reward side were smaller (p = 0.015, signed-rank test, data not shown) but such trials were rare. See Figure S2A for responses shown separately for neurons activated or suppressed at the time of the action. See Figure S2B for population activity triggered on outcome onset. (F) The regression analysis estimates a temporal profile for each task event, which in each trial is aligned to the event onset time and scaled by a coefficient. The results are summed to produce predicted traces. (G) The size of action and stimulus profiles for the full regression. Each dot presents one neuron (n = 316). (H) Top: cross-validated explained variance (EV) averaged across neurons (n = 316) for the full regression (dotted line) and for regressions each including only one type of event (bars). Bottom: variance explained by full regression (dotted line) and regressions each excluding one of the events (bars). (I) Predictions of the regression only including action events triggered on action onset, as a function of stimulus contrast and trial type. (J) Average action responses (estimated by regression on mPFC activity) as a function of trial-by-trial decision value QC (estimated from the behavioral model). Trial-by-trial variations in action-related activity (estimated from the regression) better correlated with QC in neurons with positive profile, i.e., activated neurons, compared to neurons with negative profile, i.e., suppressed neurons (Figure S2E, p = 0.011, signed-rank test), consistent with results from averaging across neuronal responses (Figure S2A). (K) Average action responses in correct trials as a function of stimulus contrast and reward size. Circles: mean; error bars: SE across neurons; shaded regions: model estimate of QC. (L) Same as (K) but for correct and error trials to the large-reward side. In (J)–(L), only neurons with significant action profile were included (241/316 neurons). See Figures S2F and S2G for responses of remaining neurons.
Figure 3
Figure 3
Dopamine Neurons Encode Confidence-Dependent Predicted Value and Prediction Error (A) Top: schematic of fiber photometry in VTA dopamine neurons. Bottom: example histology showing GCaMP expression and the position of implanted fiber above the VTA. (B) Task timeline. To allow sufficient time for GCaMP measurement, decisions could be reported only after an auditory go cue. (C) Trial-by-trial dopamine responses from all sessions of an example animal for trials with |contrast| = 0.25, aligned to stimulus onset (dashed line) and sorted by trial type (left column) and outcome time (red, light-green, and dark-green dots). (D) Dopamine responses of an example animal on correct trials as a function of contrast, for stimuli presented on the left or right side of the monitor. (E) Population dopamine responses (n = 5 mice) aligned to the stimulus. (F) Population dopamine responses aligned to the outcome. (G) Top: cross-validated explained variance (EV) averaged across mice for the full regression (dotted line) and for regressions each including only one type of event (bars). Bottom: EV of full regression (dotted line) and regressions each excluding one of the events (bars). (H) Stimulus responses, estimated from regression, as a function of trial-by-trial decision value QC, estimated by the behavioral model. (I) Average stimulus responses in the correct trials as a function of stimulus contrast and trial type (error bars: SE across animals); shaded regions: model predictions of QC. (J) Same as (I) but for correct and error trials in which the large-reward side was chosen. (K) Outcome responses, estimated from regression, as a function of trial-by-trial prediction error δ, estimated by the behavioral model. (L and M) Same as (I) and (J) for outcome responses and model estimates of δ. (N) Changes in the proportion of rightward choices as a function of dopamine activity to reward in the previous trial (black and gray: larger and smaller than 65 percentile, respectively), computed for each level of sensory stimulus in the previous trial (for left and right blocks separately), and then averaged. (O) Changes in the proportion of rightward choices as a function of dopamine activity to reward in the previous trial, computed for each reward size in the previous trial (for left and right blocks separately), and then averaged.
Figure 4
Figure 4
Learning Depends on Predicted Value Signaled by Medial Prefrontal Neurons (A) Top: to suppress mPFC population activity, we optogenetically activated Pvalb neurons by directing brief laser pulses through an optical fiber in the prelimbic area (PL). Bottom: example histology showing ChR2 expression in mPFC and position of implanted fiber above mPFC. (B) Inactivation occurred for 450 ms following stimulus onset in two different forms: in either 40% of randomly selected trials of blocks with reward size manipulation (C and D) or in blocks of trials, forming four possible blocks: with or without suppression; with large reward on the left or the right (E–G). (C) Reducing QC in the model does not influence ongoing choices. Curves are model predictions for trials with reduced QC (solid) and control trials (dashed). Consistent with the model prediction, suppressing mPFC neurons did not influence the performance in current trials. See Figure S4B for similar results in a task with no reward manipulation. (D) Effect of mPFC suppression on psychometric shifts in 5 mice. Data points show the difference in the proportion of rightward choices between the L and R blocks of the control and suppression conditions. Curves illustrates average model fits on the data. Error bars show SE across animals. (E) Reducing QC in the model magnifies psychometric bias due to reward size difference. The arrow indicates the difference in the probability of rightward choice computed from the point curves cross zero contrast in the control (dashed) and in blocks with reduced QC (solid). Consistent with the model prediction, suppressing mPFC neurons during the task magnified the shifts of psychometric curves due to the reward size difference. The data points show an example animal. (F) Effect of mPFC suppression on psychometric shifts in 6 mice. Curves illustrates average model fits on the data (with reduced QC relative to control). (G) The effect of mPFC suppression on trial-by-trial learning from the onset of the switches in reward contingencies. The shaded areas indicate data (n = 6 mice) in the control (black) and optogenetic suppression (blue) experiment, and curves are average predictions of the model fitted on the data.
Figure 5
Figure 5
Learning Depends on Prediction Error but Not Predicted Value Signaled by Dopamine Neurons (A) Left: ChR2 or Arch3 were expressed in dopamine neurons and a fiber implanted over VTA. Right: expression of ChR2 or Arch3 in dopamine neurons. (B) In the first experiment, light pulses were delivered at the time of visual stimulus in blocks of trials, forming four possible blocks (with or without inactivation, with large reward on the left or right). (C) Behavior of an example animal in the activation trials (filled circles) and control trials (empty circles). Curves are model fits. Error bars are SE across trials. See Figure S5B for population data and Figures S5C–S5G for similar results in a task without reward manipulation or when activation started before the stimulus onset. (D) Manipulation of dopamine responses at the time of outcome: light pulses were delivered following correct decisions toward one response side, which alternated in blocks of 50–350 trials. (E and F) Model-predicted horizontal psychometric curve shift (curves) accounts for dopamine-induced behavioral changes (points). The arrow indicates the difference across blocks in the probability of rightward choice in trials with zero contrast. The psychometric shifts were independent of the hemisphere manipulated (p = 0.36, 2-way ANOVA). See Figures S5H–S5J for similar results across the population and reaction times. (G) Running average of probability of rightward choice in an example session including 8 blocks (orange and brown). Black: mouse behavior. Purple: model prediction. See Figure S5K for averaged learning curves.

Comment in

References

    1. Bayer H.M., Glimcher P.W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron. 2005;47:129–141. - PMC - PubMed
    1. Beier K.T., Steinberg E.E., DeLoach K.E., Xie S., Miyamichi K., Schwarz L., Gao X.J., Kremer E.J., Malenka R.C., Luo L. Circuit Architecture of VTA Dopamine Neurons Revealed by Systematic Input-Output Mapping. Cell. 2015;162:622–634. - PMC - PubMed
    1. Bhagat J., Wells M.J., Peters A., Harris K.D., Carandini M., Burgess C.P. Rigbox: An Open-Source Toolbox for Probing Neurons and Behavior. bioRxiv. 2019 - PMC - PubMed
    1. Burgess C.P., Lak A., Steinmetz N.A., Zatka-Haas P., Bai Reddy C., Jacobs E.A.K., Linden J.F., Paton J.J., Ranson A., Schröder S. High-Yield Methods for Accurate Two-Alternative Visual Psychophysics in Head-Fixed Mice. Cell Rep. 2017;20:2513–2524. - PMC - PubMed
    1. Buschman T.J., Miller E.K. Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices. Science. 2007;315:1860–1862. - PubMed

Publication types