Integrating Reward Information for Prospective Behavior

Sam Hall-McMaster^{1

2}, Mark G Stokes^{3

2}, Nicholas E Myers^{3

2}

Affiliations

¹ Department of Experimental Psychology, University of Oxford, United Kingdom, OX2 6GG hall-mcmaster@mpib-berlin.mpg.de.
² Wellcome Centre for Integrative Neuroimaging, University of Oxford, United Kingdom, OX3 9DU.
³ Department of Experimental Psychology, University of Oxford, United Kingdom, OX2 6GG.

PMID: 35042770
PMCID: PMC8896545
DOI: 10.1523/JNEUROSCI.1113-21.2021

Integrating Reward Information for Prospective Behavior

Sam Hall-McMaster et al. J Neurosci. 2022.

. 2022 Mar 2;42(9):1804-1819.

doi: 10.1523/JNEUROSCI.1113-21.2021. Epub 2022 Jan 18.

Authors

Sam Hall-McMaster^{1

2}, Mark G Stokes^{3

2}, Nicholas E Myers^{3

2}

Affiliations

¹ Department of Experimental Psychology, University of Oxford, United Kingdom, OX2 6GG hall-mcmaster@mpib-berlin.mpg.de.
² Wellcome Centre for Integrative Neuroimaging, University of Oxford, United Kingdom, OX3 9DU.
³ Department of Experimental Psychology, University of Oxford, United Kingdom, OX2 6GG.

PMID: 35042770
PMCID: PMC8896545
DOI: 10.1523/JNEUROSCI.1113-21.2021

Abstract

Value-based decision-making is often studied in a static context, where participants decide which option to select from those currently available. However, everyday life often involves an additional dimension: deciding when to select to maximize reward. Recent evidence suggests that agents track the latent reward of an option, updating changes in their latent reward estimate, to achieve appropriate selection timing (latent reward tracking). However, this strategy can be difficult to distinguish from one in which the optimal selection time is estimated in advance, allowing an agent to wait a predetermined amount of time before selecting, without needing to monitor an option's latent reward (distance-to-goal tracking). Here, we show that these strategies can in principle be dissociated. Human brain activity was recorded using electroencephalography (EEG), while female and male participants performed a novel decision task. Participants were shown an option and decided when to select it, as its latent reward changed from trial-to-trial. While the latent reward was uncued, it could be estimated using cued information about the option's starting value and value growth rate. We then used representational similarity analysis (RSA) to assess whether EEG signals more closely resembled latent reward tracking or distance-to-goal tracking. This approach successfully dissociated the strategies in this task. Starting value and growth rate were translated into a distance-to-goal signal, far in advance of selecting the option. Latent reward could not be independently decoded. These results demonstrate the feasibility of using high temporal resolution neural recordings to identify internally computed decision variables in the human brain.SIGNIFICANCE STATEMENT Reward-seeking behavior involves acting at the right time. However, the external world does not always tell us when an action is most rewarding, necessitating internal representations that guide action timing. Specifying this internal neural representation is challenging because it might stem from a variety of strategies, many of which make similar predictions about brain activity. This study used a novel approach to test whether alternative strategies could be dissociated in principle. Using representational similarity analysis (RSA), we were able to distinguish between candidate internal representations for selection timing. This shows how pattern analysis methods can be used to measure latent decision information in noninvasive neural data.

Keywords: decision-making; pattern analysis; reward maximization; selection timing.

PubMed Disclaimer

Figures

**Figure 1.**
A, B, Task design. Participants were presented with an option that had a specific starting value (wedge size) and growth rate (color). A, Participants could choose to let the option's reward increase, by selecting the blank side of the screen during the response phase. The option reward was then updated by the growth rate and the participant performed another trial with the same item. Importantly, the same starting value and growth rate information were presented for each trial within a run, although the underlying option reward was changing. B, Participants could choose to end the trial run by selecting the side of the screen with the option during the response phase. Participants then received the current option reward and started a new trial run, with a new option. Option rewards increased linearly up to their maximum value (500 points), after which value decayed exponentially if participants continued to make wait responses. Trial runs lasted up to a maximum of 16 trials. C, Option properties. Options could have one of eight starting values, ranging from 0 to 350 points. Starting value was indicated by the size of a wedge shown on screen (with wedge size oriented randomly on each trial run). The option could have one of five growth rates, which determined the increase in an option's latent reward per wait response. Growth rates included 35, 40, 50, 60, and 80 points per wait trial, which were selected to maximize variability in the optimal select trial from a starting value of 0. Two colors were used per growth rate for each participant. The two color sets were used interchangeably during the task. One trial run could use a color from set A and the next could use a color from set B, or vice versa. D, Regression coefficients showing the influence of starting value and growth rate on the timing of option selection (trial on which the option was selected within the trial run). The lower middle and upper horizontal bars within each plot indicate the 25th percentile, the median and the 75th percentile, respectively. Colored circles within each box show the mean across participants and vertical lines extending from these circles show the SEM. Vertical whiskers extending from each box indicate the most extreme upper and lower values within 1.5 times the interquartile range. Values outside this were deemed outliers and are indicated with a + symbol; *** above each condition indicates regression coefficients are significantly different from zero at p < 0.001. E, F, Summary of selection timing for each starting value-growth rate combination. E, The optimal trial on which to select each option to get the maximum number of points. F, The median trial on which participants selected each option. G, The trial on which participants selected the option (y-axis) as a function of the optimal trial to select the option to gain maximum reward (x-axis). The black diagonal provides a reference line for an optimal agent and the purple line shows mean participant data. Shading around the purple line shows the SEM. H, I, Summary of performance biases for each starting value-growth rate combination. H, The difference between the optimal and empirical selection timing (the action bias). I, The difference between the median number of points earned and the maximum number that could be earned (the reward loss).

**Figure 2.**
Neural decoding of presented and latent variables. A, Average decoding strength for starting value information (wedge size) during the decision phase (0–800 ms), on the first three trials within a run. B, Time course for starting value decoding for the first trial within a run (0–2000 ms from trial onset). C, D, Abstract growth rate decoding, independent of stimulus color due to cross-decoding across color sets. C, Average decoding strength for abstract growth rate information during the decision phase (0–800 ms), on the first three trials within a run. D, Time course for abstract growth rate decoding on the first trial within a run (0–2000 ms from trial onset). E, Average decoding strength for latent reward, which is the starting value plus the product of growth rate $\times$ trials waited so far in the run. This is shown during the decision phase (0–800 ms) on the second and third trials in a run. Latent reward is not shown on trial 1 because it would be equivalent to starting value. F, Time course for latent reward decoding on the second trial within a run (0–2000 ms). G, Average decoding strength for distance-to-goal information (i.e., the number of trials from the current trial to the optimal selection trial) during the decision phase (0–800 ms), for the first three trials within a run. H, Time course for distance-to-goal decoding on the second trial within a run (0–2000 ms from trial onset). ***I–L***, Multistage RSA of latent task variables. The cross-validated decoding in panels ***A–H*** does not control for the influence of other task variables, which could drive the decoding results. The multistage RSA in ***I–L*** regresses out neural activity related to other task variables, before decoding the variable of interest. This provides a measure of latent variable decoding, independent from other task variables. The pattern alignment measure refers to how much the neural dissimilarity matrix can be predicted from a model dissimilarity matrix that contains expected condition differences. I, J, Pattern alignment for latent reward encoding after starting value (wedge size), growth rate (color), and distance-to-goal have been regressed out of the neural dissimilarity matrix obtained from the EEG signal. I, Average pattern alignment for latent reward encoding during the decision phase (0–800 ms) for the second and third trials within a trial run. J, Pattern alignment time course for latent reward encoding on the second trial within a run (0–2000 ms). K, L, Pattern alignment for goal distance encoding when starting value, growth rate and latent reward have been regressed out of the EEG signal. K, Average pattern alignment for distance-to-goal during the decision phase (0–800 ms), for the first three trials within a run. L, Pattern alignment time course for distance-to-goal encoding on the second trial within a run (0–2000 ms). A, C, E, G, I, K, The lower middle and upper horizontal bars within each box indicate the 25th percentile, the median, and the 75th percentile, respectively. Colored circles within each box show the mean across participants, and vertical lines extending from these circles show the SEM. Vertical whiskers extending from each box indicate the most extreme upper and lower values within 1.5 times the interquartile range. Values outside this were deemed outliers and are indicated with a + symbol. Asterisk symbols above each condition indicate decoding coefficients are significantly different from zero at ***p < 0.001, **p < 0.01, *p < 0.05. Asterisk symbols above a bar that bridges two conditions indicate a significant difference between conditions at ***p < 0.001, **p < 0.01, *p < 0.05. Asterisk symbols are based on Bonferroni-Holm (BH) corrected p-values. B, D, F, H, J, L, Vertical lines show onset of the decision phase, delay, and response phase, respectively. Shading in the decoding time courses show the SEM. Solid colored lines under time courses indicate when significant decoding is observed (cluster corrected p-values < 0.05).

**Figure 3.**
Neural decoding of empirical selection timing. A, Decoding time course for the number of wait trials until a select response is made (the distance-to-select). B, Decoding time course for the difference between the options current latent reward and its latent reward when selected (the reward-to-select). C, Decoding time course for the decision to select versus the decision to wait on the previous trial. ***A–C***, Vertical lines show onset of the decision phase, delay, and response phase, respectively. Shading around the principal line in the decoding time courses show the SEM. Solid colored lines under time courses indicate when significant decoding is observed (cluster corrected p-values < 0.05).

**Figure 4.**
***A–C***, Top three performing models of selection timing. A, A cost of waiting model, which includes a reward cost proportional to the number of steps waited. B, A model that includes a cost of waiting as well as biased representations of the starting value and growth rate. C, A model that includes a cost of waiting and a biased representation of the growth rate. ***A–C***, Each plot shows the number of trials participants waited before selecting the option (purple line) as a function of the optimal number of trials to wait before selecting the option. The black diagonal provides a reference line for an optimal agent. The remaining colored line shows the cross-validated performance of the model fit to participant data. Shading around each colored line shows the SEM. D, The difference in mean squared error (MSE) between a given model and the cost of waiting model. More negative values indicate worse model performance, relative to the cost of waiting model. The y-axis shows each model being compared with the costs of waiting model. The x-axis shows the difference in MSE. Mean differences across the sample are shown with dots enclosed in a black circle and 95% confidence intervals are shown with black lines extending from the sample means. MSE differences for individual participants are shown for the top competitor models as colored dots and the remaining competitor models as gray dots. E, Correlation between the average distance-to-goal coding during the decision phase (0–800 ms) and the cost of waiting parameter estimates. Black lines indicate linear fits to the data and gray lines indicate 95% confidence intervals of the fits.

See this image and copyright information in PMC

References

1. Ambekar A, Ward C, Mohammed J, Male S, Skiena S (2009) Name-ethnicity classification from open sources. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 49–58.
1. Baillet S, Garnero L (1997) A Bayesian approach to introducing anatomo-functional priors in the EEG/MEG inverse problem. IEEE Trans Biomed Eng 44:374–385. - PubMed
1. Baumann C, Singmann H, Gershman SJ, von Helversen B (2020) A linear threshold model for optimal stopping behavior. Proc Natl Acad Sci USA 117:12750–12755. 10.1073/pnas.2002312117 - DOI - PMC - PubMed
1. Bertolero MA, Dworkin JD, David SU, Lloreda CL, Srivastava P, Stiso J, Zhou D, Dzirasa K, Fair DA, Kaczkurkin AN, Jones Marlin B, Shohamy D, Uddin LQ, Zurn P, Bassett DS (2020) Racial and ethnic imbalance in neuroscience reference lists and intersections with gender. bioRxiv. doi: 10.1101/2020.10.12.336230. - DOI
1. Bulley A, Henry J, Suddendorf T (2016) Prospection and the present moment: the role of episodic foresight in intertemporal choices between immediate and delayed rewards. Rev Gen Psychol 20:29–47. 10.1037/gpr0000061 - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Integrating Reward Information for Prospective Behavior

Affiliations

Integrating Reward Information for Prospective Behavior

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources