This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Dec 26:arXiv:2411.03630v2.

RTify: Aligning Deep Neural Networks with Human Behavioral Decisions

Yu-Ang Cheng¹, Ivan Felipe Rodriguez¹, Sixuan Chen¹, Kohitij Kar², Takeo Watanabe¹, Thomas Serre¹

Affiliations

PMID: 39764401
PMCID: PMC11703321

RTify: Aligning Deep Neural Networks with Human Behavioral Decisions

Yu-Ang Cheng et al. ArXiv. 2024.

[Preprint]. 2024 Dec 26:arXiv:2411.03630v2.

Authors

Yu-Ang Cheng¹, Ivan Felipe Rodriguez¹, Sixuan Chen¹, Kohitij Kar², Takeo Watanabe¹, Thomas Serre¹

Affiliations

¹ Brown University.
² York University.

PMID: 39764401
PMCID: PMC11703321

Abstract

Current neural network models of primate vision focus on replicating overall levels of behavioral accuracy, often neglecting perceptual decisions' rich, dynamic nature. Here, we introduce a novel computational framework to model the dynamics of human behavioral choices by learning to align the temporal dynamics of a recurrent neural network (RNN) to human reaction times (RTs). We describe an approximation that allows us to constrain the number of time steps an RNN takes to solve a task with human RTs. The approach is extensively evaluated against various psychophysics experiments. We also show that the approximation can be used to optimize an "ideal-observer" RNN model to achieve an optimal tradeoff between speed and accuracy without human data. The resulting model is found to account well for human RT data. Finally, we use the approximation to train a deep learning implementation of the popular Wong-Wang decision-making model. The model is integrated with a convolutional neural network (CNN) model of visual processing and evaluated using both artificial and natural image stimuli. Overall, we present a novel framework that helps align current vision models with human behavior, bringing us closer to an integrated model of human vision.

PubMed Disclaimer

Figures

**Figure 1:. Illustration of our RTify method.**
The input is a visual stimulus represented by random moving dots, but the model can also accommodate color images and video sequences. We take a pretrained task-optimized RNN and use a trainable function $f_{w}$ to transform the activity of the network into a real-valued evidence measure, $e_{t}$ , that will be integrated over time by an evidence accumulator, $Φ_{t}$ . When the evidence accumulator reaches the threshold $θ$ , processing stops, and a decision is taken. The time step at which the accumulated evidence passes this threshold $τ_{θ}$ is taken as the model RT for this stimulus.

**Figure 2:. RTified model evaluation on a RDM task [24].**
Human data are shown as a gray shaded area, and model fits are shown for **(A)** the “supervised” setting where human behavioral responses are used to train the models and **(B)** the “self-penalized” setting where no human data is used. Our approach (green) outperforms the two alternative approaches (brown), i.e., entropy-thresholding [29] for the “supervised” and uncertainty proxy [30] for the “self-penalized” settings (see Fig. 4 for MSE comparisons and Fig. S3 for all coherences).

**Figure 3:. Illustration of RTifying feedforward neural networks.**
We develop a multi-class compatible and fully differentiable RNN module based on the WW model [21, 22]. This module is implemented as an attractor-based RNN, and is stacked on top of a feedforward neural network. The feedforward neural network first takes an image as the input. Outputs from classification units of the network are then sent to RTified WW **(A)**. Information is accumulated by multiple populations of neurons in RTified WW while they compete with each other **(B)**. A decision is made and the process stops when one of the populations reaches a threshold. The number of time steps needed for the RTified WW to reach the threshold is used to predict human RT **(C)**.

**Figure 4:. MSE comparisons for the RDM task [24] for all coherence levels.**
**(A)** The RTified model trained in the ”supervised” setting (i.e., with human behavioral responses; green solid line) performs better (lower MSE) than entropy-thresholding [29] (brown solid line) under all coherence levels. Similarly, the RTified model trained in the ”self-penalized” setting (i.e., without human data; green dash line) performs better than uncertainty proxy [30] (brown dash line). With the help of our RTified WW module (orange solid line), a convolution neural network (C3D) can also fit the data better than entropy-thresholding [29]. **(B) Classification accuracy comparisons between pretrained and RTified models for the RDM task** [24]. The RTified model trained with human RTs data in the ”supervised” setting (green solid line) and in the ”self-penalized” setting (green dash line) achieve human-like classification accuracy under all coherence levels compared with the pretrained model without RTify (green dotted line). With the help of our RTified WW module (orange solid line), a CNN (C3D) matches human accuracy better than the pretrained model without RTify (orange dotted line).

**Figure 5:. RTified model evaluation on an object categorization task [39].**
Model vs. human RT predictions for our RTified model (green) vs. alternative approaches (brown) **(A)** in the “supervised” setting where human behavioral responses are used to train the model and **(B)** the “self-penalized” setting where no human data is used. Solid lines are linear regression fits between model and human RTs. Crossed-shaded areas and the dashed lines are controls to show the fits after removing the highest model RTs. Our approach outperforms the two alternative approaches, i.e., entropy-thresholding [29] for the “supervised” setting and uncertainty proxy [30] for the “self-penalized” setting.

**Figure 6:. RTified WW model evaluation.**
We combine our RTified WW module with **(A)** a 3D CNN to fit human RTs collected in an RDM task [24] (see Fig. 4 for MSE comparisons with other methods) and **(B)** a VGG to fit human RTs in a rapid object categorization task [39] (Crossed-shaded areas and the dashed lines are controls to show the fits after removing the highest model RTs).

See this image and copyright information in PMC

References

1. Oliva A., Torralba A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision 42(3) (2001) 145–175
1. Hubel D.H., Wiesel T.N.: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology 160(1) (1962) 106. - PMC - PubMed
1. Itti L., Koch C., Niebur E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence 20(11) (1998) 1254–1259
1. Jagadeesh A.V., Gardner J.L.: Texture-like representation of objects in human visual cortex. Proceedings of the National Academy of Sciences 119(17) (2022) e2115302119 - PMC - PubMed
1. Doshi F.R., Konkle T., Alvarez G.A.: A feedforward mechanism for human-like contour integration. bioRxiv (2024) 2024–06 - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

RTify: Aligning Deep Neural Networks with Human Behavioral Decisions

Affiliations

RTify: Aligning Deep Neural Networks with Human Behavioral Decisions

Authors

Affiliations

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources