Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2019 Jun 19;102(6):1096-1110.
doi: 10.1016/j.neuron.2019.04.023.

The Encoding of Speech Sounds in the Superior Temporal Gyrus

Affiliations
Review

The Encoding of Speech Sounds in the Superior Temporal Gyrus

Han Gyol Yi et al. Neuron. .

Abstract

The human superior temporal gyrus (STG) is critical for extracting meaningful linguistic features from speech input. Local neural populations are tuned to acoustic-phonetic features of all consonants and vowels and to dynamic cues for intonational pitch. These populations are embedded throughout broader functional zones that are sensitive to amplitude-based temporal cues. Beyond speech features, STG representations are strongly modulated by learned knowledge and perceptual goals. Currently, a major challenge is to understand how these features are integrated across space and time in the brain during natural speech comprehension. We present a theory that temporally recurrent connections within STG generate context-dependent phonological representations, spanning longer temporal sequences relevant for coherent percepts of syllables, words, and phrases.

Keywords: acoustic-phonetic features; auditory cortex; context-dependent representation; electrocorticography; phonological sequence; speech processing; superior temporal gyrus; temporal integration; temporal landmarks; temporally recurrent connections.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Speech sounds can be described in multiple complementary ways.
For example, the English words pin, fin, and fun are characterized according to several different but related descriptions, ranging from physical acoustic features to abstract linguistic features. (A) The acoustic waveforms of show a broad distinction between low amplitude and aperiodic features (consonants), and high amplitude and strong periodic features (vowels). (B) Spectrogram representations of these words show how each sound is characterized by different spectrotemporal patterns of acoustic energy. (C) Articulatory descriptions of these sounds characterize acoustic-phonetic features. Plosives are produced by initially blocking the airflow (gray), then releasing air through the mouth (black), generating a short broadband burst in the spectrogram. Fricatives are produced by partially occluding the passage of air in the mouth, generating a longer-duration, high-frequency broadband noise in the spectrogram. These two features are examples of obstruents. (D) High-front vowels are produced by moving the tongue to the top and front of the mouth, creating a resonance cavity that generates relatively low first formant and high second formant values. In contrast, low-back vowels show the reverse pattern. These two features are examples of sonorants. (E) Each of the example words can also be characterized as a set of successive abstract phonemes: /pin/, /fin/, and /fʌn/. (F) Multiple features are combined to describe unique phonemes. Here, obstruent, plosive, unvoiced, and labial features are combined to describe the English phoneme /p/. Changing the plosive feature to fricative, and the bilabial feature to labio-dental describes the phoneme /f/ (not all possible features are shown for simplicity).
Figure 2.
Figure 2.. Local encoding of acoustic-phonetic features in human superior temporal gyrus (STG).
Using direct electrocorticography (ECoG), neural responses to speech can be measured with concurrently high spatial and temporal resolution. These data reveal the encoding of acoustic-phonetic features in local populations during speech perception. (A) ECoG electrodes over human STG (outlined in black) show robust evoked responses to distinct sounds during listening to (B) naturally-spoken sentences. (C) Each electrode shows selective responses to groups of phonemes, corresponding to acoustic-phonetic features. (D) Electrodes sensitive to specific acoustic-phonetic features (e.g., fricative or low-back vowels) have spectrotemporal receptive fields that strongly resemble the average acoustic spectrograms of sounds characterized by those features (adapted from Mesgarani et al., 2014).
Figure 3.
Figure 3.. STG is parcellated into two major zones that track temporal landmarks relevant for speech processing.
Broad regions encoding temporal landmarks have acoustic-phonetic feature detectors embedded in them, facilitating temporal context-dependent speech representations. (A) Speech can be characterized by multiple temporal/linguistic scales ranging from features to syllables to words to phrases. (B) Onsets from silence cue prosodic phrase boundaries. (C) Amplitude envelope change dynamics are a major source of acoustic variability, and peaks in the rate of change correspond to syllabic nuclei. (D) STG is characterized by a global spatial organization for temporal landmarks. Posterior STG tracks onsets following a period of silence that is 200 ms or longer, while middle-to-anterior STG has more sustained responses that may track peaks in the rate of amplitude envelope change. Neural populations in both regions are tuned to acoustic-phonetic features, suggesting that STG integrates temporal landmarks and instantaneous phonetic units.
Figure 4.
Figure 4.. STG combines acoustic-phonetic tuning with various sources of context to compute perceptual representations of speech.
(A) Example words and their acoustic spectrograms that differ in a single phoneme/acoustic-phonetic feature (/s/ vs. /k/), and a stimulus with masking noise (/#/) completely replacing the middle sound. (B) Stimulus encoding involves detecting acoustic-phonetic features with tuned neural populations (e.g., fricative populations respond to /s/ and plosive populations respond to /k/). This response is embedded in both local and distributed representations of context (orange texture), including sensitivity to language-level sequence statistics (phonotactics), lexical statistics like word frequency, and attention to particular speakers. In the case of the ambiguous sound, STG neural populations “restore” the missing phoneme by activating the appropriate acoustic-phonetic tuned population in real-time, possibly using a combination of these multiple sources of context. (C) The output of STG population activity reflects the perceptual experience of the listener. Specifically, STG activity encodes the percept of the phonological sequence, in this case the whole words “faster” or “factor”. In the case of ambiguous input (A; bottom), these percepts do not directly correspond to the input acoustic signal.
Figure 5.
Figure 5.. Computational implementations of temporal sequencing and binding in speech cortex.
(A) How does the brain bind instantaneous acoustic-phonetic features (e.g., /∫/, /ɑ/, and /p/) into perceptually coherent sequences (e.g., “shop”)? (B) A dedicated temporal integrator receives feature representations from STG. (C) Distinct STG populations (recorded with different electrodes: e1, e2, etc.) detect acoustic-phonetic features from the acoustic input (D) by generating spatially and temporally independent neural responses. (E) Detected features are passed to a separate mechanism that tracks temporal order and is capable of temporal integration. (F) The temporal integrator/sequencer has a relatively long temporal window, and is thus able to bind multiple feature inputs across time. (G) The sequence representation contains markers of temporal order (e.g., /∫/1 /ɑ/2, and /p/3). (H-J) An alternative framework has context-dependent acoustic-phonetic feature representations that arise from temporally recurrent connections. (H) The laminar organization of human cortex provides a means for input and output connections across layers and columns to implement temporal recurrence, where input to layer IV is contextually-modulated by prior output from supragranular layers and thalamic inputs. (I) Unfolded across time, the neural representation of the input is a function of the past state of the network via temporally-recurrent connections among feature detectors. (J) At the population level, the representation across time of the sequence shop (/∫ɑp/) is distinguishable from that of ship (/∫ip/) not only based on the instantaneous responses to the vowels (/ɑ/ vs. /i/), but also from the context-modulated responses to the final consonants (/p//ɑ/ vs. /p//i/; i.e., /p/ does not occupy a single point in the state-space).

References

    1. Ahissar E, Nagarajan S, Ahissar M, Protopapas A, Mahncke H, & Merzenich MM (2001). Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. Proceedings of the National Academy of Sciences, 98(23), 13367–13372. - PMC - PubMed
    1. Arsenault JS, & Buchsbaum BR (2015). Distributed neural representations of phonological features during speech perception. Journal of Neuroscience, 35(2), 634–642. - PMC - PubMed
    1. Atencio CA, & Schreiner CE (2016). Functional congruity in local auditory cortical microcircuits. Neuroscience, 316, 402–419. - PMC - PubMed
    1. Baddeley A (1992). Working memory. Science, 255(5044), 556–559. - PubMed
    1. Barbour DL, & Callaway EM (2008). Excitatory local connections of superficial neurons in rat auditory cortex. Journal of Neuroscience, 28(44), 11174–11185. - PMC - PubMed

Publication types