Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 11;7(1):711.
doi: 10.1038/s42003-024-06372-6.

Cortical-striatal brain network distinguishes deepfake from real speaker identity

Affiliations

Cortical-striatal brain network distinguishes deepfake from real speaker identity

Claudia Roswandowitz et al. Commun Biol. .

Abstract

Deepfakes are viral ingredients of digital environments, and they can trick human cognition into misperceiving the fake as real. Here, we test the neurocognitive sensitivity of 25 participants to accept or reject person identities as recreated in audio deepfakes. We generate high-quality voice identity clones from natural speakers by using advanced deepfake technologies. During an identity matching task, participants show intermediate performance with deepfake voices, indicating levels of deception and resistance to deepfake identity spoofing. On the brain level, univariate and multivariate analyses consistently reveal a central cortico-striatal network that decoded the vocal acoustic pattern and deepfake-level (auditory cortex), as well as natural speaker identities (nucleus accumbens), which are valued for their social relevance. This network is embedded in a broader neural identity and object recognition network. Humans can thus be partly tricked by deepfakes, but the neurocognitive mechanisms identified during deepfake processing open windows for strengthening human resilience to fake information.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the deepfake voice synthesis, acoustic representation of identity-encoding features, and experimental tasks.
a The deepfake synthesis consisted of the acoustic voice feature extraction of the natural target and source speaker, training of the Gaussian mixture model (GMM), and conversion of the synthesized idiosyncratic acoustic voice profile of the target speaker with the natural speech sound of the source speaker. b Distribution of identity-encoding voice features in natural and deepfake voices. We scaled acoustic values to facilitate visualization and model output comparisons. LMMs assessed acoustic differences between natural and deepfake voices. Asterisks indicate *p < .05, Bonferroni-corrected for n = 7 models, ns: nonsignificant. Circles indicate sentence-specific acoustic values with speaker-specific color coding. Horizontal lines indicate the mean values. c Experimental design of the fMRI matching task, including an identity and speech task. d Accuracy of the fMRI matching task. Statistics based on LMM (nobservations=100, nparticipants = 25) with task and sound condition as fixed effects and participants as a random factor. Asterisks indicate **p < .0001, *p < .001. Circles indicate individual data and horizontal lines mean performances.
Fig. 2
Fig. 2. Brain responses and functional networks for natural and deepfake speaker identities.
a Neural activity patterns for contrasting [IDnat > IDdf]. White dashed line indicates the voice-sensitive regions evoked by the functional voice localizer scan. Second-level group t-maps, p < 0.005 corrected at the cluster level k > 47. b Beta estimates in right NAcc, right mid STG, and left mid STG for the conditions that were included in the interaction contrast ([IDnat > IDdf]  >  [SPEECHnat  >  SPEECHdf]). Plots show individual parameter estimates (n = 25) extracted from the maximum statistic for the contrasts shown in (a). Asterisks indicate significant effects of LMMs, *p < .001, We ran LMMs to specify the interaction effects. Circles indicate individual data and horizontal lines mean values. c Functional connectivity networks supporting natural identity matching (second-level group t-maps, gPPI, p < 0.005 corrected at the cluster level k > 47) from contrasting IDnat > baseline]; neural seeds as highlighted in (a). d Functional connectivity patterns with higher and lower connectivity for IDnat>IDdf and IDdf>IDnat (second-level group t-maps, gPPI, p < 0.005 corrected at the cluster level k > 47); neural seeds as highlighted in (a). Cd: caudate nucleus, FG: fusiform gyrus, HC: hippocampus, IFG, orb: inferior frontal gyrus, pars orbitalis, tri: pars triangularis, ITG: inferior temporal gyrus, L: left, LOC: lateral occipital cortex, MTG: middle temporal gyrus, NAcc: nucleus accumbens, OFG: occipital fusiform gyrus, PCG: posterior cingulate gyrus, PrCG: precentral gyrus, PT: planum temporale, R: right, SMC: supplementary motor cortex, SMG: supramarginal gyrus, STG: superior temporal gyrus, TP: temporal pole.
Fig. 3
Fig. 3. Multivariate neural decoding of natural and deepfake speaker identities.
Multivariate decoding accuracies revealed by group-averaged confusion matrices comparing the frequency of the predicted sound class with the true sound class for a the right NAcc, b,c six subregions of the bilateral AC, d the IFG pars triangularis, and e TP. We anatomically defined region-of-interest (ROI) maps. Colored frames indicate decoding accuracies significantly above chance (n = 24, chance level 25%, one-sampled t-test, p < 0.001, Bonferroni corrected for the number of subregions per ROI).
Fig. 4
Fig. 4. Social perception encoded in AC activity.
Scatterplots show the association (n = 25, regression models, p < 0.05) between brain activity differences and measures of social perception. Circles indicate individual data and gray area shows +/- s.e.m. a Association of voice naturalness ratings with bilateral AC activity. AC activity for [IDdf > IDnat] increased for less natural rated deepfake voices (NATdf) in relation to natural voices (NATnat). b Association of likability ratings with left AC activity. AC activity for [IDdf > IDnat] increased for less likable rated deepfake voices (LIKdf) in relation to natural voices (LIKnat). For raw rating data see Supplementary Table 11A.

Similar articles

Cited by

References

    1. Hancock JT, Bailenson JN. The Social Impact of Deepfakes. Cyberpsychology, Behav. Soc. Netw. 2021;24:149–152. doi: 10.1089/cyber.2021.29208.jth. - DOI - PubMed
    1. Vaccari, C. & Chadwick, A. Deepfakes and Disinformation: Exploring the Impact of Synthetic Political Video on Deception, Uncertainty, and Trust in News. Soc. Media Soc. 6, 2056305120903408 (2020).
    1. Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C. & Niebner, M. Face2Face: Real-Time Face Capture and Reenactment of RGB Videos. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2387–2395 (2016).
    1. Perrodin C, Kayser C, Abel TJ, Logothetis NK, Petkov CI. Who is That? Brain Networks and Mechanisms for Identifying Individuals. Trends Cogn. Sci. 2015;19:783–796. doi: 10.1016/j.tics.2015.09.002. - DOI - PMC - PubMed
    1. Bai Z, Zhang XL. Speaker recognition based on deep learning: An overview. Neural Netw. 2021;140:65–99. doi: 10.1016/j.neunet.2021.03.004. - DOI - PubMed

Publication types