Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 28;20(7):e3001675.
doi: 10.1371/journal.pbio.3001675. eCollection 2022 Jul.

Neural responses in human superior temporal cortex support coding of voice representations

Affiliations

Neural responses in human superior temporal cortex support coding of voice representations

Kyle Rupp et al. PLoS Biol. .

Abstract

The ability to recognize abstract features of voice during auditory perception is an intricate feat of human audition. For the listener, this occurs in near-automatic fashion to seamlessly extract complex cues from a highly variable auditory signal. Voice perception depends on specialized regions of auditory cortex, including superior temporal gyrus (STG) and superior temporal sulcus (STS). However, the nature of voice encoding at the cortical level remains poorly understood. We leverage intracerebral recordings across human auditory cortex during presentation of voice and nonvoice acoustic stimuli to examine voice encoding at the cortical level in 8 patient-participants undergoing epilepsy surgery evaluation. We show that voice selectivity increases along the auditory hierarchy from supratemporal plane (STP) to the STG and STS. Results show accurate decoding of vocalizations from human auditory cortical activity even in the complete absence of linguistic content. These findings show an early, less-selective temporal window of neural activity in the STG and STS followed by a sustained, strongly voice-selective window. Encoding models demonstrate divergence in the encoding of acoustic features along the auditory hierarchy, wherein STG/STS responses are best explained by voice category and acoustics, as opposed to acoustic features of voice stimuli alone. This is in contrast to neural activity recorded from STP, in which responses were accounted for by acoustic features. These findings support a model of voice perception that engages categorical encoding mechanisms within STG and STS to facilitate feature extraction.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Auditory-evoked HGA.
(A) Example channel in left HG, patient P7, shown in coronal (upper panel) and axial (lower) slices. (B) Auditory-evoked spectral response averaged across all NatS stimuli in channel from (A). Vertical lines represent stimulus on- and offset, with horizontal lines demarcating frequency boundaries for broadband HGA at 70 and 150 Hz. (C) Mean HGA in the same channel. (D) Auditory responsiveness, quantified as the 2-sample t-value between mean HGA in 500 ms pre- and poststimulus onset windows. Small black dots represent channels with no auditory response, i.e., t-values that failed to reach significance (p < 0.05, FDR corrected). Associated data are located on Zenodo in the Fig 1B and 1C folder (doi: 10.5281/zenodo.6544488). FDR, false discovery rate; HG, Heschl’s gyrus; HGA, high-gamma activity; NatS, Natural Sounds; STG, superior temporal gyrus; STP, supratemporal plane; STS, superior temporal sulcus.
Fig 2
Fig 2. Decoding accuracy results.
(A) Full model (i.e., all channels and time windows) decoding accuracy of vocal versus nonvocal for each patient. Dark and light blue bars correspond to NatS results with speech stimuli included or excluded, respectively (e.g., light blue is nonspeech human vocalizations versus nonvocal auditory stimuli). White dots represent statistical significance (p < 0.01, Bonferroni corrected, permutation tests). (B) Sliding window results. Vertical lines represent stimulus offset for the 2 tasks, with horizontal lines showing fraction of patients with statistically significant decoding in that window (p < 0.001, FDR corrected, cluster-based permutation tests). (C) Cross-task decoding accuracy, with color indicating the training set (white: p < 0.01, red: p < 0.05, Bonferroni corrected, permutation tests). Associated data are located on Zenodo in the Fig 2 folder (doi: 10.5281/zenodo.6544488). FDR, false discovery rate; NatS, Natural Sounds; VL, Voice Localizer.
Fig 3
Fig 3. Single channel results.
(A) HGA separability between vocal and nonvocal NatS stimuli, across all patients. Channel sizes are proportional to t-statistics comparing auditory response magnitude between 500 ms pre- and poststimulus onset windows, same as Fig 1D. (B) HGA for 2 example channels located in PT (upper panel) and uSTS (lower panel). Black bars show clusters of significantly different timepoints; V–NV separability (panels A, E) is the sum of all clusters for a given channel. Note that while both channels achieve V–NV separability throughout the duration of the stimulus, the magnitude of the nonvocal response differs between the 2 channels, with the NV response of the uSTS channel returning near baseline after the initial onset window. In contrast, the V response remains elevated in both onset and sustained windows, for both the PT and uSTS channels. (C) Mean HGA averaged across 2 different windows: onset (0 to 500 ms) and sustained (500 to 2,000 ms). (D) The HGA ratio is calculated as the difference between vocal and nonvocal responses, relative to their sum. This metric, spanning from −1 to 1, describes a channel’s vocal category preference strength: a value near 1 (or −1) represents a channel that responds only to vocal (or nonvocal) stimuli, while a value of 0 represents equal HGA responses to both stimulus categories. (E) All channels with V–NV separability exhibit onset responses to both stimulus categories: in this early window, HGA ratios reveal that STG and STS (compared to STP) shows a slightly diminished response to nonvocal relative to vocal stimuli. During the sustained window, a strong preference for vocal stimuli emerges in STG and STS, while nonvocal responses return near silent baseline. Associated data are located on Zenodo in the Fig 3B–3D folder (doi: 10.5281/zenodo.6544488). HGA, high-gamma activity; NatS, Natural Sounds; PT, planum temporale; STG, superior temporal gyrus; STP, supratemporal plane; STS, superior temporal sulcus; uSTS, upper STS; V–NV, vocal–nonvocal.
Fig 4
Fig 4. Encoding model results.
Linear regression encoding models suggest that STP is primarily driven by acoustic features, while STG and STS responses are much more influenced by category-like information. Model inputs consisted of both low- and high-level acoustic features such as loudness, MFCCs, spectral flux, and relative formant ratios. Full models also included a binary feature indicating vocal category membership. Likelihood ratio test statistics compare this full model to a nested, acoustic-only model and thus describe the improvement conferred by V–NV class information. Well-fit channels in STP are modeled best by acoustic features throughout both the onset and sustained windows. Meanwhile, STG and STS channels also perform well and benefit from the addition of category-level information, with a slight skew toward the later sustained window. STG, superior temporal gyrus; STP, supratemporal plane; STS, superior temporal sulcus; V–NV, vocal–nonvocal.

Comment in

  • The path of voices in our brain.
    Morillon B, Arnal LH, Belin P. Morillon B, et al. PLoS Biol. 2022 Jul 29;20(7):e3001742. doi: 10.1371/journal.pbio.3001742. eCollection 2022 Jul. PLoS Biol. 2022. PMID: 35905075 Free PMC article.

References

    1. Romanski LM, Averbeck BB. The primate cortical auditory system and neural representation of conspecific vocalizations. Annu Rev Neurosci. 2009;32:315–46. doi: 10.1146/annurev.neuro.051508.135431 - DOI - PMC - PubMed
    1. Bodin C et al.. Functionally homologous representation of vocalizations in the auditory cortex of humans and macaques. Curr Biol. 2021. doi: 10.1016/j.cub.2021.08.043 - DOI - PMC - PubMed
    1. Mathias S. R., von Kriegstein K. Voice Processing and Voice-Identity Recognition. in Timbre: Acoustics, Perception, and Cognition (eds. Siedenburg K., Saitis C., McAdams S., Popper A. N., Fay R. R.) 175–209. Springer International Publishing; 2019. doi: 10.1007/978-3-030-14832-4_7 - DOI
    1. Hepper PG, Scott D, Shahidullah S. Newborn and fetal response to maternal voice. J Reprod Infant Psychol. 1993;11:147–53.
    1. Kuhl PK. Early language acquisition: Cracking the speech code. Nat Rev Neurosci. 2004. doi: 10.1038/nrn1533 - DOI - PubMed

Publication types