Speech Computations of the Human Superior Temporal Gyrus

doi:10.1146/annurev-psych-022321-035256

Review

. 2022 Jan 4:73:79-102.

doi: 10.1146/annurev-psych-022321-035256. Epub 2021 Oct 21.

Speech Computations of the Human Superior Temporal Gyrus

Ilina Bhaya-Grossman^{1

2}, Edward F Chang¹

Affiliations

¹ Department of Neurological Surgery, University of California, San Francisco, California 94143, USA; email: edward.chang@ucsf.edu.
² Joint Graduate Program in Bioengineering, University of California, Berkeley and San Francisco, California 94720, USA.

PMID: 34672685
PMCID: PMC9447996
DOI: 10.1146/annurev-psych-022321-035256

Review

Speech Computations of the Human Superior Temporal Gyrus

Ilina Bhaya-Grossman et al. Annu Rev Psychol. 2022.

. 2022 Jan 4:73:79-102.

doi: 10.1146/annurev-psych-022321-035256. Epub 2021 Oct 21.

Authors

Ilina Bhaya-Grossman^{1

2}, Edward F Chang¹

Affiliations

¹ Department of Neurological Surgery, University of California, San Francisco, California 94143, USA; email: edward.chang@ucsf.edu.
² Joint Graduate Program in Bioengineering, University of California, Berkeley and San Francisco, California 94720, USA.

PMID: 34672685
PMCID: PMC9447996
DOI: 10.1146/annurev-psych-022321-035256

Abstract

Human speech perception results from neural computations that transform external acoustic speech signals into internal representations of words. The superior temporal gyrus (STG) contains the nonprimary auditory cortex and is a critical locus for phonological processing. Here, we describe how speech sound representation in the STG relies on fundamentally nonlinear and dynamical processes, such as categorization, normalization, contextual restoration, and the extraction of temporal structure. A spatial mosaic of local cortical sites on the STG exhibits complex auditory encoding for distinct acoustic-phonetic and prosodic features. We propose that as a population ensemble, these distributed patterns of neural activity give rise to abstract, higher-order phonemic and syllabic representations that support speech perception. This review presents a multi-scale, recurrent model of phonological processing in the STG, highlighting the critical interface between auditory and language systems.

Keywords: categorization; contextual restoration; phonological processing; superior temporal gyrus; temporal landmarks.

PubMed Disclaimer

Figures

**Figure 1**
Within- and between-speaker variability pose a challenge to speech comprehension. (a) A higher-pitch speaker produces two instances of “bat” slightly differently (labeled as utterance 1 and utterance 2), but both speech sequences map onto the same linguistic content. Key within-speaker differences in the speech waveform and spectrogram representation of the acoustic signal include changes in the amplitude of the speech envelope, shifted spectral peaks, and different final phoneme durations. (b) The same speaker as in panel a produces the word “mat.” Corresponding acoustic-phonetic features are shown in the lowest panel, indicating the manner and place of the articulatory gesture that produces the corresponding sound. (c) A different, lower-pitch speaker than the speakers in panels a and b produces the word “bat.” Key between-speaker differences in the speech waveform and spectrogram representation of the acoustic signal include changes in the amplitude of the speech envelope and shifted spectral peaks. Between-speaker variability can be due to several specific speaker characteristics, such as the length of the speaker’s vocal tract, speaker rate, and accent.

**Figure 2**
ECoG enables high-resolution recording of neural activity in the nonprimary auditory cortex. (a) This panel illustrates the anatomical boundary of the STG. The color gradient represents the functionally differentiated posterior and middle regions of the STG (Ozker et al. 2017, Yi et al. 2019, Hamilton et al. 2020). (b) Example sentences from the TIMIT corpus are shown at the top, where time from the most recent sentence onset is marked (Garofolo et al. 1993). Single electrode activity is aligned to the onset of speech and averaged across all corpus sentences. The cortical responses to the speech stimulus across the STG reveal a wide array of response profiles, even between responses recorded 4–8 millimeters apart (showing slow sustained cortical response for the electrode labeled E1 and rapid response to sentence onset for the electrode labeled E2). Abbreviations: ECoG, electrocorticogram; mSTG, middle superior temporal gyrus; pSTG, posterior superior temporal gyrus; STG, superior temporal gyrus.

**Figure 3**
Patterns of activity across the STG allow for the categorization of phonological units. (a) Single electrodes (selected electrodes are shown as colored circles labeled E1, E2, and E3) respond to incremental acoustic change, showing graded linear (E1 and E2) or abrupt nonlinear (E3) monotonic tuning to certain spectral features (e.g., F2 onset frequency or magnitude of F2 transition). Single electrode responses do not prefer a phonemic category but are tuned more generally to auditory cues such as the example acoustic-phonetic features shown in this panel. (b) Schematic depiction of the categorical neural encoding of speech sounds, derived from patterns of activity across the population. Information distributed across the electrodes (selected electrodes illustrated in subpanel i) can be used to determine the phonemic category of presented speech sounds (e.g., /ba/, /da/, /ga/) (subpanel ii) and reflects the perceptual experience of the listener. Further, overlapping functionality in the neural code (*blue* and *purple circles* in the rightmost diagram) may be important for retaining within-category sensitivities. Abbreviations: F2, second spectral peak; STG, superior temporal gyrus.

**Figure 4**
Language-dependent neural tuning supports the categorization of lexical tone. (a) Four distinct tone categories (high, rising, dipping, and falling) were included in this experiment, in which native Mandarin and English speakers were presented with naturally produced Mandarin speech (Li et al. 2021). This panel shows an example Mandarin sentence with the extracted pitch contour overlaid on a spectrogram representation of the speech sequence (color of the contour indicates corresponding tone category). (b) Single electrode responses to relative pitch height can be categorized based on the positive or negative relationship between relative pitch height (x-axis) and cortical response amplitude (y-axis). (c) Analysis of electrode pitch encoding reveals a balanced distribution of STG electrodes in native Mandarin speakers that are either negatively or positively tuned to relative pitch (−, +). In native English speakers, STG electrodes show primarily positive relative pitch tuning (+). Whereas lexical tone category can be decoded from the population-level neural response in native Mandarin speakers, the decodability of lexical tone is significantly reduced in English native speakers. These results indicate that the distribution of STG pitch tuning is biased depending on the language experience of the listener.

**Figure 5**
A new model of phonological analysis. (a) Classical model of auditory word recognition in which primarily serial, feedforward, hierarchical processing takes place. The first processing step is spectrotemporal analysis, through which relevant features are extracted. Spectrotemporal features are grouped into phonemic segments that are then sequentially assembled into syllables. Finally, the lexical interface maps phonological sequences onto word-level representations. In classic models of auditory word recognition, each processing step is assigned to an approximate anatomical location (the schematic to the right shows an example of these assignments). The neural representation of speech becomes of increasingly higher order as it moves through successive brain areas. (b) An alternative recurrent, multi-scale, and interactive model of auditory word recognition that more closely aligns with the presented neurophysiological evidence. Acoustic signal inputs are analyzed concurrently by local processors with selectivity for acoustic phonetic features, salient temporal landmarks (e.g., peakRate), and prosodic features that occur over phonemic segments. The light gray bidirectional arrows indicate that local processors interact with one another. Recurrent connectivity indicates an integration of temporal context and sensitivity to phonological sequences by binding inputs over time during word processing. Anticipatory top-down, word-level information arises from the lexical-semantic system and the internal dynamics of ongoing phonological analysis. (c) Three local neuronal populations (*circles*, *triangles*, and *crosses*) on the STG encode relative (speaker-normalized) formant values, relative pitch changes, and the magnitude of peakRate events. In addition to being functionally diverse, these populations likely show distinct electrophysiological signatures (i.e., sustained versus rapid responses) (see Figure 2). The encoding of normalized spectral content (formants and pitch) suggests the presence of a context-sensitive mechanism that enables rapid retuning to speaker-specific spectral bands. Together, this set of neural responses and the responses at the previous time step define a neural state from which the appropriate word form can be decoded. Every sound segment is processed by the STG in a highly specific context that is sensitive to both temporal and phonological information. Abbreviations: MTG, middle temporal gyrus, STG, superior temporal gyrus; STS, superior temporal sulcus.

See this image and copyright information in PMC

Cited by

Abnormal changes of brain function and structure in patients with T2DM-related cognitive impairment: a neuroimaging meta-analysis and an independent validation.
Dai P, Yu Y, Sun Q, Yang Y, Hu B, Xie H, Li SN, Cao XY, Ni MH, Cui YY, Bai XY, Bi JJ, Cui GB, Yan LF. Dai P, et al. Nutr Diabetes. 2024 Nov 11;14(1):91. doi: 10.1038/s41387-024-00348-5. Nutr Diabetes. 2024. PMID: 39528442 Free PMC article. Review.
Targeting auditory verbal hallucinations in schizophrenia: effective connectivity changes induced by low-frequency rTMS.
Yuanjun X, Guan M, Zhang T, Ma C, Wang L, Lin X, Li C, Wang Z, Zhujing M, Wang H, Peng F. Yuanjun X, et al. Transl Psychiatry. 2024 Sep 28;14(1):393. doi: 10.1038/s41398-024-03106-4. Transl Psychiatry. 2024. PMID: 39341819 Free PMC article. Clinical Trial.
Noncanonical short-latency auditory pathway directly activates deep cortical layers.
Garcia MM, Kline AM, Onodera K, Tsukano H, Dandu PR, Acosta HC, Kasten MR, Manis PB, Kato HK. Garcia MM, et al. Nat Commun. 2025 Jul 1;16(1):5911. doi: 10.1038/s41467-025-61020-9. Nat Commun. 2025. PMID: 40593664 Free PMC article.
Functional connectivity across the human subcortical auditory system using an autoregressive matrix-Gaussian copula graphical model approach with partial correlations.
Chandra NK, Sitek KR, Chandrasekaran B, Sarkar A. Chandra NK, et al. Imaging Neurosci (Camb). 2024;2:imag-2-00258. doi: 10.1162/imag_a_00258. Epub 2024 Aug 12. Imaging Neurosci (Camb). 2024. PMID: 39421593 Free PMC article.
Sensory processing of native and non-native phonotactic patterns in the alpha and beta frequency bands.
Wagner M, Rusiniak M, Higby E, Nourski KV. Wagner M, et al. Neuropsychologia. 2023 Oct 10;189:108659. doi: 10.1016/j.neuropsychologia.2023.108659. Epub 2023 Aug 12. Neuropsychologia. 2023. PMID: 37579990 Free PMC article.

See all "Cited by" articles

References

1. Ahissar E, Nagarajan S, Ahissar M, Protopapas A, Mahncke H, Merzenich MM. 2001. Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. PNAS 98(23):13367–72 - PMC - PubMed
1. Allen EJ, Burton PC, Olman CA, Oxenham AJ. 2017. Representations of pitch and timbre variation in human auditory cortex. J. Neurosci 37(5):1284–93 - PMC - PubMed
1. Anderson LA, Linden JF. 2011. Physiological differences between histologically defined subdivisions in the mouse auditory thalamus. Hear. Res 274(1–2):48–60 - PMC - PubMed
1. Bartlett EL. 2013. The organization and physiology of the auditory thalamus and its role in processing acoustic features important for speech perception. Brain Lang 126(1):29–48 - PMC - PubMed
1. Benson RR, Richardson M, Whalen DH, Lai S. 2006. Phonetic processing areas revealed by sinewave speech and acoustically similar non-speech. NeuroImage 31(1):342–53 - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

[1] Ahissar E, Nagarajan S, Ahissar M, Protopapas A, Mahncke H, Merzenich MM. 2001. Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. PNAS 98(23):13367–72 - PMC - PubMed

[2] Ahissar E, Nagarajan S, Ahissar M, Protopapas A, Mahncke H, Merzenich MM. 2001. Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. PNAS 98(23):13367–72 - PMC - PubMed

[3] Allen EJ, Burton PC, Olman CA, Oxenham AJ. 2017. Representations of pitch and timbre variation in human auditory cortex. J. Neurosci 37(5):1284–93 - PMC - PubMed

[4] Allen EJ, Burton PC, Olman CA, Oxenham AJ. 2017. Representations of pitch and timbre variation in human auditory cortex. J. Neurosci 37(5):1284–93 - PMC - PubMed

[5] Anderson LA, Linden JF. 2011. Physiological differences between histologically defined subdivisions in the mouse auditory thalamus. Hear. Res 274(1–2):48–60 - PMC - PubMed

[6] Anderson LA, Linden JF. 2011. Physiological differences between histologically defined subdivisions in the mouse auditory thalamus. Hear. Res 274(1–2):48–60 - PMC - PubMed

[7] Bartlett EL. 2013. The organization and physiology of the auditory thalamus and its role in processing acoustic features important for speech perception. Brain Lang 126(1):29–48 - PMC - PubMed

[8] Bartlett EL. 2013. The organization and physiology of the auditory thalamus and its role in processing acoustic features important for speech perception. Brain Lang 126(1):29–48 - PMC - PubMed

[9] Benson RR, Richardson M, Whalen DH, Lai S. 2006. Phonetic processing areas revealed by sinewave speech and acoustically similar non-speech. NeuroImage 31(1):342–53 - PubMed

[10] Benson RR, Richardson M, Whalen DH, Lai S. 2006. Phonetic processing areas revealed by sinewave speech and acoustically similar non-speech. NeuroImage 31(1):342–53 - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Speech Computations of the Human Superior Temporal Gyrus

Affiliations

Speech Computations of the Human Superior Temporal Gyrus

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources