. 2024 Jun 11;7(1):711.

doi: 10.1038/s42003-024-06372-6.

Cortical-striatal brain network distinguishes deepfake from real speaker identity

Claudia Roswandowitz^{1

2

3}, Thayabaran Kathiresan^{4

5}, Elisa Pellegrino⁶, Volker Dellwo⁶, Sascha Frühholz^{7

8

9}

Affiliations

¹ Cognitive and Affective Neuroscience Unit, Department of Psychology, University of Zurich, Zurich, Switzerland. claudia.roswandowitz@uzh.ch.
² Phonetics and Speech Sciences Group, Department of Computational Linguistics, University of Zurich, Zurich, Switzerland. claudia.roswandowitz@uzh.ch.
³ Neuroscience Centre Zurich, University of Zurich and ETH Zurich, Zurich, Switzerland. claudia.roswandowitz@uzh.ch.
⁴ Centre for Neuroscience of Speech, University Melbourne, Melbourne, Australia.
⁵ Redenlab, Melbourne, Australia.
⁶ Phonetics and Speech Sciences Group, Department of Computational Linguistics, University of Zurich, Zurich, Switzerland.
⁷ Cognitive and Affective Neuroscience Unit, Department of Psychology, University of Zurich, Zurich, Switzerland.
⁸ Neuroscience Centre Zurich, University of Zurich and ETH Zurich, Zurich, Switzerland.
⁹ Department of Psychology, University of Oslo, Oslo, Norway.

PMID: 38862808
PMCID: PMC11166919
DOI: 10.1038/s42003-024-06372-6

Cortical-striatal brain network distinguishes deepfake from real speaker identity

Claudia Roswandowitz et al. Commun Biol. 2024.

. 2024 Jun 11;7(1):711.

doi: 10.1038/s42003-024-06372-6.

Authors

Claudia Roswandowitz^{1

2

3}, Thayabaran Kathiresan^{4

5}, Elisa Pellegrino⁶, Volker Dellwo⁶, Sascha Frühholz^{7

8

9}

Affiliations

¹ Cognitive and Affective Neuroscience Unit, Department of Psychology, University of Zurich, Zurich, Switzerland. claudia.roswandowitz@uzh.ch.
² Phonetics and Speech Sciences Group, Department of Computational Linguistics, University of Zurich, Zurich, Switzerland. claudia.roswandowitz@uzh.ch.
³ Neuroscience Centre Zurich, University of Zurich and ETH Zurich, Zurich, Switzerland. claudia.roswandowitz@uzh.ch.
⁴ Centre for Neuroscience of Speech, University Melbourne, Melbourne, Australia.
⁵ Redenlab, Melbourne, Australia.
⁶ Phonetics and Speech Sciences Group, Department of Computational Linguistics, University of Zurich, Zurich, Switzerland.
⁷ Cognitive and Affective Neuroscience Unit, Department of Psychology, University of Zurich, Zurich, Switzerland.
⁸ Neuroscience Centre Zurich, University of Zurich and ETH Zurich, Zurich, Switzerland.
⁹ Department of Psychology, University of Oslo, Oslo, Norway.

PMID: 38862808
PMCID: PMC11166919
DOI: 10.1038/s42003-024-06372-6

Abstract

Deepfakes are viral ingredients of digital environments, and they can trick human cognition into misperceiving the fake as real. Here, we test the neurocognitive sensitivity of 25 participants to accept or reject person identities as recreated in audio deepfakes. We generate high-quality voice identity clones from natural speakers by using advanced deepfake technologies. During an identity matching task, participants show intermediate performance with deepfake voices, indicating levels of deception and resistance to deepfake identity spoofing. On the brain level, univariate and multivariate analyses consistently reveal a central cortico-striatal network that decoded the vocal acoustic pattern and deepfake-level (auditory cortex), as well as natural speaker identities (nucleus accumbens), which are valued for their social relevance. This network is embedded in a broader neural identity and object recognition network. Humans can thus be partly tricked by deepfakes, but the neurocognitive mechanisms identified during deepfake processing open windows for strengthening human resilience to fake information.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Overview of the deepfake voice synthesis, acoustic representation of identity-encoding features, and experimental tasks.**
a The deepfake synthesis consisted of the acoustic voice feature extraction of the natural target and source speaker, training of the Gaussian mixture model (GMM), and conversion of the synthesized idiosyncratic acoustic voice profile of the target speaker with the natural speech sound of the source speaker. b Distribution of identity-encoding voice features in natural and deepfake voices. We scaled acoustic values to facilitate visualization and model output comparisons. LMMs assessed acoustic differences between natural and deepfake voices. Asterisks indicate *p < .05, Bonferroni-corrected for n = 7 models, ns: nonsignificant. Circles indicate sentence-specific acoustic values with speaker-specific color coding. Horizontal lines indicate the mean values. c Experimental design of the fMRI matching task, including an identity and speech task. d Accuracy of the fMRI matching task. Statistics based on LMM (n_observations=100, n_participants = 25) with task and sound condition as fixed effects and participants as a random factor. Asterisks indicate **p < .0001, *p < .001. Circles indicate individual data and horizontal lines mean performances.

**Fig. 2. Brain responses and functional networks for natural and deepfake speaker identities.**
a Neural activity patterns for contrasting [ID_nat > ID_df]. White dashed line indicates the voice-sensitive regions evoked by the functional voice localizer scan. Second-level group t-maps, p < 0.005 corrected at the cluster level k > 47. b Beta estimates in right NAcc, right mid STG, and left mid STG for the conditions that were included in the interaction contrast ([ID_nat > ID_df] > [SPEECH_nat > SPEECH_df]). Plots show individual parameter estimates (n = 25) extracted from the maximum statistic for the contrasts shown in (a). Asterisks indicate significant effects of LMMs, *p < .001, We ran LMMs to specify the interaction effects. Circles indicate individual data and horizontal lines mean values. c Functional connectivity networks supporting natural identity matching (second-level group t-maps, gPPI, p < 0.005 corrected at the cluster level k > 47) from contrasting ID_nat > baseline]; neural seeds as highlighted in (a). d Functional connectivity patterns with higher and lower connectivity for ID_nat>ID_df and ID_df>ID_nat (second-level group t-maps, gPPI, p < 0.005 corrected at the cluster level k > 47); neural seeds as highlighted in (a). Cd: caudate nucleus, FG: fusiform gyrus, HC: hippocampus, IFG, orb: inferior frontal gyrus, pars orbitalis, tri: pars triangularis, ITG: inferior temporal gyrus, L: left, LOC: lateral occipital cortex, MTG: middle temporal gyrus, NAcc: nucleus accumbens, OFG: occipital fusiform gyrus, PCG: posterior cingulate gyrus, PrCG: precentral gyrus, PT: planum temporale, R: right, SMC: supplementary motor cortex, SMG: supramarginal gyrus, STG: superior temporal gyrus, TP: temporal pole.

**Fig. 3. Multivariate neural decoding of natural and deepfake speaker identities.**
Multivariate decoding accuracies revealed by group-averaged confusion matrices comparing the frequency of the predicted sound class with the true sound class for a the right NAcc, b,c six subregions of the bilateral AC, d the IFG pars triangularis, and e TP. We anatomically defined region-of-interest (ROI) maps. Colored frames indicate decoding accuracies significantly above chance (n = 24, chance level 25%, one-sampled t-test, p < 0.001, Bonferroni corrected for the number of subregions per ROI).

**Fig. 4. Social perception encoded in AC activity.**
Scatterplots show the association (n = 25, regression models, p < 0.05) between brain activity differences and measures of social perception. Circles indicate individual data and gray area shows +/- s.e.m. a Association of voice naturalness ratings with bilateral AC activity. AC activity for [ID_df > ID_nat] increased for less natural rated deepfake voices (NAT_df) in relation to natural voices (NAT_nat). b Association of likability ratings with left AC activity. AC activity for [ID_df > ID_nat] increased for less likable rated deepfake voices (LIK_df) in relation to natural voices (LIK_nat). For raw rating data see Supplementary Table 11A.

See this image and copyright information in PMC

Cited by

fNIRS experimental study on the impact of AI-synthesized familiar voices on brain neural responses.
Zhang W, Li J, Ji L, Cheng X, Sun D, Jiang Y, Chen F, Zhou Y, Choi C, Cheng H, Cai S. Zhang W, et al. Sci Rep. 2025 May 15;15(1):16872. doi: 10.1038/s41598-025-92702-5. Sci Rep. 2025. PMID: 40374900 Free PMC article.

References

1. Hancock JT, Bailenson JN. The Social Impact of Deepfakes. Cyberpsychology, Behav. Soc. Netw. 2021;24:149–152. doi: 10.1089/cyber.2021.29208.jth. - DOI - PubMed
1. Vaccari, C. & Chadwick, A. Deepfakes and Disinformation: Exploring the Impact of Synthetic Political Video on Deception, Uncertainty, and Trust in News. Soc. Media Soc. 6, 2056305120903408 (2020).
1. Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C. & Niebner, M. Face2Face: Real-Time Face Capture and Reenactment of RGB Videos. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2387–2395 (2016).
1. Perrodin C, Kayser C, Abel TJ, Logothetis NK, Petkov CI. Who is That? Brain Networks and Mechanisms for Identifying Individuals. Trends Cogn. Sci. 2015;19:783–796. doi: 10.1016/j.tics.2015.09.002. - DOI - PMC - PubMed
1. Bai Z, Zhang XL. Speaker recognition based on deep learning: An overview. Neural Netw. 2021;140:65–99. doi: 10.1016/j.neunet.2021.03.004. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Cortical-striatal brain network distinguishes deepfake from real speaker identity

Affiliations

Cortical-striatal brain network distinguishes deepfake from real speaker identity

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources