Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Mar 18;40(12):2553-2561.
doi: 10.1523/JNEUROSCI.2355-19.2020. Epub 2020 Feb 14.

Primate Orbitofrontal Cortex Codes Information Relevant for Managing Explore-Exploit Tradeoffs

Affiliations

Primate Orbitofrontal Cortex Codes Information Relevant for Managing Explore-Exploit Tradeoffs

Vincent D Costa et al. J Neurosci. .

Erratum in

Abstract

Reinforcement learning (RL) refers to the behavioral process of learning to obtain reward and avoid punishment. An important component of RL is managing explore-exploit tradeoffs, which refers to the problem of choosing between exploiting options with known values and exploring unfamiliar options. We examined correlates of this tradeoff, as well as other RL related variables, in orbitofrontal cortex (OFC) while three male monkeys performed a three-armed bandit learning task. During the task, novel choice options periodically replaced familiar options. The values of the novel options were unknown, and the monkeys had to explore them to see if they were better than other currently available options. The identity of the chosen stimulus and the reward outcome were strongly encoded in the responses of single OFC neurons. These two variables define the states and state transitions in our model that are relevant to decision-making. The chosen value of the option and the relative value of exploring that option were encoded at intermediate levels. We also found that OFC value coding was stimulus specific, as opposed to coding value independent of the identity of the option. The location of the option and the value of the current environment were encoded at low levels. Therefore, we found encoding of the variables relevant to learning and managing explore-exploit tradeoffs in OFC. These results are consistent with findings in the ventral striatum and amygdala and show that this monosynaptically connected network plays an important role in learning based on the immediate and future consequences of choices.SIGNIFICANCE STATEMENT Orbitofrontal cortex (OFC) has been implicated in representing the expected values of choices. Here we extend these results and show that OFC also encodes information relevant to managing explore-exploit tradeoffs. Specifically, OFC encodes an exploration bonus, which characterizes the relative value of exploring novel choice options. OFC also strongly encodes the identity of the chosen stimulus, and reward outcomes, which are necessary for computing the value of novel and familiar options.

Keywords: decision-making; explore–exploit; monkey; orbitofrontal cortex; reinforcement learning.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Task and recording locations. A, Structure of an individual trial in the three-arm bandit task where monkeys indicated their choice by making a saccade to one of three options. Following each choice, the monkeys received either a fixed amount of juice reward with a probability conditioned on the stimulus or no reward. B, Each block of 650 trials began with the presentation of three novel images. This set, s, of visual choice options was repeatedly presented to the monkey for a series of 10–30 trials. On a randomly selected trial between 10 and 30 one of the existing options was randomly selected and replaced with a novel image. This formed a new set of options that were presented for a series of 10–30 trials. Novel options were randomly assigned their own reward probabilities (0.2, 0.5, 0.8). Configurations in which all three options had the same reward probability were not allowed. This process of introducing a novel option to create a new set was repeated 32 times within a block. C, MRI guided reconstruction of recording locations. Coronal T1-weighted MRI of electrodes lowered to specific depths for neuroimaging, were used to verify the trajectories and placement of the recording electrodes in each monkey. The number of cells recorded in the OFC at each site were projected on to template views from a standard macaque brain atlas.
Figure 2.
Figure 2.
Behavior and model. A, Fraction of times the monkeys chose the novel option as a function of the number of trials since the novel option was introduced. Data plotted separately for novel options assigned reward probabilities of 0.2, 0.5, and 0.8. B, Fraction of times the monkeys chose the novel option as a function of trials since novel, relative to the best alternative and worst alternative familiar options. Best and worst were defined as the option with the best IEV and worst IEV, not including the novel option. C, Fraction of times the novel option was selected, relative to the best and worst familiar options, as a function of the IEV of the best alternative option. D, Average IEV of novel options estimated by POMDP model, as a function of trials since the introduction of a novel option. Data are plotted separately for options with IEV of 0.8, 0.5, and 0.2. Note that these are model value estimates, and not behavioral choices. The IEV should asymptotically approach the true value of the choice, which is 0.8, 0.5, or 0.2. E, Average Bonus of novel option, relative to best and worst familiar option, as a function of trials since introduction of novel option. F, Average FEV of the chosen option, as a function of the IEV of the best option currently available. GI, Predicted versus measured choice probabilities for individual Monkeys H, F, and N, respectively. These values are for all choices, not just for the novel options. Correlation between predicted and measured for G: r = 0.48 ± 0.079 (N = 9 sessions); H: r = 0.29 ± 0.052 (N = 23 sessions); I: r = 0.60 ± 0.033 (N = 42 sessions).
Figure 3.
Figure 3.
Task relevant neural encoding. The fraction of the population of neurons that significantly (p < 0.05) encoded task relevant variables (top) and their associated effect size (bottom, ω2; Olejnik and Algina, 2000). A sliding-window ANOVA was performed on spikes counted in 200 ms bins, advanced in 50 ms increments. Effect size estimates were calculated for all neurons showing a significant effect in at least 5 windows from 0 to 1200 ms after cue or reward onset. Bar plots indicate the fraction of significant neurons encoding each task factor for each of the three monkeys. The shading of each bar reflects the number of neurons recorded in each monkey (Monkey H = 12; Monkey F = 63; N = 71 neurons). A, IEV of the chosen option. B, Exploration bonus associated with chosen option. C, FEV of the chosen option. D, Stimulus-specific identity of the chosen option. E, Outcome of the current (blue) and previous choice (orange) trial. F, Screen location of the chosen option.
Figure 4.
Figure 4.
Single neuron example of variability in stimulus encoding. A, Average response to chosen stimulus in a window from 0 to 400 ms after cue onset. B, Spike density functions showing average response to the 10 different high-valued stimuli. Each line represents the mean for a different stimulus.
Figure 5.
Figure 5.
Alternative encoding model. All variables assessed at p < 0.05. A, Encoding of IEV as an interaction (IEVx) with chosen stimulus identity and stimulus as an non-nested main effect (Stim). B, Overlay of non-nested interaction from second ANOVA model and nested stimulus identity encoding from first ANOVA model.

References

    1. Alexander GE, DeLong MR, Strick PL (1986) Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu Rev Neurosci 9:357–381. 10.1146/annurev.ne.09.030186.002041 - DOI - PubMed
    1. Amaral DG, Price JL, Pitkanen A, Carmichael ST (1992) Anatomical organization of the primate amygdaloid complex. In: The amygdala: neurobiological aspects of emotion, memory, and mental dysfunction (Aggleton JP, ed), pp. 1–66. New York: Wiley.
    1. Asaad WF, Eskandar EN (2008) A flexible software tool for temporally-precise behavioral control in Matlab. J Neurosci Methods 174:245–258. 10.1016/j.jneumeth.2008.07.014 - DOI - PMC - PubMed
    1. Averbeck BB. (2015) Theory of choice in bandit, information sampling and foraging tasks. PLoS Comput Biol 11:e1004164. 10.1371/journal.pcbi.1004164 - DOI - PMC - PubMed
    1. Averbeck BB. (2017) Amygdala and ventral striatum population codes implement multiple learning rates for reinforcement learning. 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, 2017, pp. 1–5. 10.1109/SSCI.2017.8285354 - DOI

Publication types