Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 8:3:imag_a_00525.
doi: 10.1162/imag_a_00525. eCollection 2025.

Alignment of auditory artificial networks with massive individual fMRI brain data leads to generalisable improvements in brain encoding and downstream tasks

Affiliations

Alignment of auditory artificial networks with massive individual fMRI brain data leads to generalisable improvements in brain encoding and downstream tasks

Maëlle Freteault et al. Imaging Neurosci (Camb). .

Abstract

Artificial neural networks trained in the field of artificial intelligence (AI) have emerged as key tools to model brain processes, sparking the idea of aligning network representations with brain dynamics to enhance performance on AI tasks. While this concept has gained support in the visual domain, we investigate here the feasibility of creating auditory artificial neural models directly aligned with individual brain activity. This objective raises major computational challenges, as models have to be trained directly with brain data, which is typically collected at a much smaller scale than data used to train AI models. We aimed to answer two key questions: (1) Can brain alignment of auditory models lead to improved brain encoding for novel, previously unseen stimuli? (2) Can brain alignment lead to generalisable representations of auditory signals that are useful for solving a variety of complex auditory tasks? To answer these questions, we relied on two massive datasets: a deep phenotyping dataset from the Courtois neuronal modelling project, where six subjects watched four seasons (36 h) of theFriendsTV series in functional magnetic resonance imaging and the HEAR benchmark, a large battery of downstream auditory tasks. We fine-tuned SoundNet, a small pretrained convolutional neural network with ~2.5 M parameters. Aligning SoundNet with brain data from three seasons ofFriendsled to substantial improvement in brain encoding in the fourth season, extending beyond auditory and visual cortices. We also observed consistent performance gains on the HEAR benchmark, particularly for tasks with limited training data, where brain-aligned models performed comparably with the best-performing models regardless of size. We finally compared individual and group models, finding that individual models often matched or outperformed group models in both brain encoding and downstream task performance, highlighting the data efficiency of fine-tuning with individual brain data. Our results demonstrate the feasibility of aligning artificial neural network representations with individual brain activity during auditory processing, and suggest that this alignment is particularly beneficial for tasks with limited training data. Future research is needed to establish whether larger models can achieve even better performance and whether the observed gains extend to other tasks, particularly in the context of few-shot learning.

Keywords: artificial neural networks; auditory neuroscience; deep phenotyping datasets; downstream generalisation; functional magnetic resonance imaging (fMRI); individual-specific computational models.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1.
Fig. 1.
Overview of the analysis. In this study, we used a naturalistic fMRI dataset to align internal features of a pretrained network with brain signals, using an AI training technique called fine-tuning. We evaluated how brain alignment changed the performance of the network, both for tasks that the network has been trained and optimised for (within distribution) and for new tasks (out of distribution).
Fig. 2.
Fig. 2.
Overview of the training framework. We provided the audio track of the TV showFriendsto a pretrained convolutional network, SoundNet. Initially, we extracted the output from the 7th convolutional layer of SoundNet, with its parametersfrozen(fixed values that remain unchanged during training), and used this output as input to train a final encoding layer to predict fMRI activity from a subject watching the TV show. This model serves as ourbaseline. In a second phase, we partially retrained SoundNet along with the encoding layer by fine-tuning all parameters up to the selected layer, allowing these parameters to be updated during training. This new model, where internal layers are fine-tuned to better align with cerebral activity, is referred to as thebrain-alignedmodel. The results presented here were obtained using the model fine-tuned up to convolutional layer 4, as depicted in this figure, but we also tested models fine-tuned at various depths, ranging from Conv7 to Conv1.
Fig. 3.
Fig. 3.
Proportion distribution of labelled audio in all half-episodes betweenFriends’s seasons. The proportion of labelled audio for each half-episode has been obtained using a ResNet 22 pretrained on AudioSet. Pair of seasons with a significantly different distribution in labelled audio proportion are indicated with an asterisk (p < 0.05).
Fig. 4.
Fig. 4.
Full brain encoding using SoundNet with no fine-tuning. Surface maps of each subject, showing the r² value for all ROIs from the MIST ROI parcellation. Only parcels with r² values significantly higher than those of a null model initialised with random weights are shown (Wilcoxon test, FDR q < 0.05). Regions with highest r² scores are the STG bilaterally, yet significant brain encoding is achieved throughout most of the cortex, with relatively high values found in the visual cortex as well.
Fig. 5.
Fig. 5.
STG encoding using Soundnet with no fine-tuning and fMRI data with spatial smoothing. Mapping of the r² scores from 556 voxels inside the cerebral region defined as the Middle STG by the parcellation MIST ROI, computed by the individual baseline model. To have a better representation of the STG, 4 slices have been selected in each subject, 2 from the left hemisphere (-63 and -57) and 2 from the right hemisphere (63 and 57). Only voxels with r² values significantly higher than those of a null model initialised with random weights are shown (Wilcoxon test, FDR q < 0.05). Individual anatomical T1 has been used as background.
Fig. 6.
Fig. 6.
Individual impact of Brain-aligned SoundNet on the full brain encoding. For each subject, on the left side: Surface maps of the r² scores computed with each individual Conv4 model, for the 210 ROIs of the MIST ROI parcellation. Coloured ROIs have an r² score significantly greater than the null model (Wilcoxon test, FDR q < 0.05). On the right side: surface maps of the percentage of difference in r² scores in each ROI between individual Conv4 and baseline models. Only ROIs where Conv4 model have an r² score greater than +/- 0.05 and significantly greater or lesser than the baseline model are displayed (Wilcoxon test, FDR q < 0.05).
Fig. 7.
Fig. 7.
STG encoding using brain-aligned SoundNet and fMRI data with spatial smoothing. For each subject, on the top: mapping of the r² scores from 556 voxels inside the cerebral region defined as the Middle STG by the parcellation MIST ROI, computed by the individual Conv4 model. Only voxels with r² values significantly higher than those of a null model initialised with random weights are shown (Wilcoxon test, FDR q < 0.05). For each subject, on the bottom: mapping of the difference of r² scores between the Conv4 model and the baseline model. Only voxels from the Conv4 model with r² values greater than +/- 0.05 and significantly greater or lesser than those of the baseline model are shown (Wilcoxon test, FDR q < 0.05). Individual anatomical T1 has been used as background.
Fig. 8.
Fig. 8.
Comparison of prediction accuracy for subject-specific fMRI data using models trained on the same versus other subjects’ data. We computed the difference of r² scores computed by a brain-aligned model trained on data from the same subject as the test data, versus trained on one (from blue to brown) or a group of individual data (pink) different from the subject’s data used for testing. The difference is computed for each of the 48 half-episodes of the fourth season ofFriends. A Wilcoxon test has been used to determine whether the difference was significant between one individual model and the group model as well as each of the other five individual models (p < 0.05).
Fig. 9.
Fig. 9.
Rank variation between Conv4 and baseline models on all tasks from the HEAR benchmark. Adaptation offigure 2of appendix B from the original HEAR paper (Turian et al., 2022), showing a similarity visualisation of all 19 tasks, based upon normalised scores. For each task, the change of rank between the baseline model and the Conv4 model is symbolised by a coloured circle. Performance from both whole-brain and STG versions of the individual models (half-circle on the left) and group models (half-circle on the right) has been averaged for each of the 19 tasks from the HEAR benchmark. When the change of rank is equal to +1 (light yellow), Conv4 model is performing better than SoundNet at the task, but does not outperform other models. Significativity has been tested using a Wilcoxon test (p < 0.05).
Fig. 10.
Fig. 10.
Rank variation between whole-brain and middle STG models on all tasks from the HEAR benchmark. Adaptation offigure 2of appendix B from the original HEAR paper (Turian et al., 2022), showing a similarity visualisation of all 19 tasks, based upon normalised scores. For each task, the change of rank between the baseline model and the Conv4 model is symbolised by a coloured circle. Left: Average change of rank with the whole-brain models (six models for half a circle). Right: Average change of rank with the STG models (six models for half a circle). Due to the low number of models per task, significance for each task has not been tested at this level.
Fig. 11.
Fig. 11.
Rank variation between Conv4 and baseline models on all tasks from the HEAR benchmark, ordered by dataset size. Each individual Conv4 model (both whole-brain and middle STG models) has been used to resolve the 19 tasks from the HEAR benchmark, ordered by the size of training dataset available through the benchmark. We extracted from the official HEAR Benchmark Leaderboardthe performances of 8 small models (up to 12 M parameters) and 21 large models (from 22 to 1,339 M parameters). We compared our brain-aligned and baseline models performances against the ones from large models (L columns, on the right side for each subject), and small models (S columns, on the left side for each subject). For each task, the change of rank between the baseline model and the Conv4 model is symbolised by a coloured circle.

Similar articles

Cited by

References

    1. Abraham , A. , Pedregosa , F. , Eickenberg , M. , Gervais , P. , Mueller , A. , Kossaifi , J. , Gramfort , A. , Thirion , B. , & Varoquaux , G. ( 2014. ). Machine learning for neuroimaging with scikit-learn . Frontiers in Neuroinformatics , 8 , 71792 . 10.3389/fninf.2014.00014 - DOI - PMC - PubMed
    1. Allen , E. J. , St-Yves , G. , Wu , Y. , Breedlove , J. L. , Prince , J. S. , Dowdle , L. T. , Nau , M. , Caron , B. , Pestilli , F. , Charest , I. , Hutchinson , J. B. , Naselaris , T. , & Kay , K. ( 2022. ). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence . Nature Neuroscience , 25 ( 1 ), 116 – 126 . 10.1038/s41593-021-00962-x - DOI - PubMed
    1. Arandjelovic , R. , & Zisserman , A. ( 2017. ). Look, listen and learn . In 2017 IEEE International Conference on Computer Vision (ICCV) , Venice, Italy: (pp. 609–617). IEEE; . 10.1109/ICCV.2017.73 - DOI
    1. Aytar , Y. , Vondrick , C. , & Torralba , A. ( 2016. ). Soundnet: Learning sound representations from unlabeled video . Advances in Neural Information Processing Systems , 29 , 892 – 900 . 10.48550/arXiv.1610.09001 - DOI
    1. Baevski , A. , Zhou , Y. , Mohamed , A. , & Auli , M. ( 2020. ). wav2vec 2.0: A framework for self-supervised learning of speech representations . Advances in Neural Information Processing Systems , 33 , 12449 – 12460 . 10.48550/arXiv.2006.11477 - DOI

LinkOut - more resources