Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 13;30(2):629-38.
doi: 10.1523/JNEUROSCI.2742-09.2010.

How the human brain recognizes speech in the context of changing speakers

Affiliations

How the human brain recognizes speech in the context of changing speakers

Katharina von Kriegstein et al. J Neurosci. .

Abstract

We understand speech from different speakers with ease, whereas artificial speech recognition systems struggle with this task. It is unclear how the human brain solves this problem. The conventional view is that speech message recognition and speaker identification are two separate functions and that message processing takes place predominantly in the left hemisphere, whereas processing of speaker-specific information is located in the right hemisphere. Here, we distinguish the contribution of specific cortical regions, to speech recognition and speaker information processing, by controlled manipulation of task and resynthesized speaker parameters. Two functional magnetic resonance imaging studies provide evidence for a dynamic speech-processing network that questions the conventional view. We found that speech recognition regions in left posterior superior temporal gyrus/superior temporal sulcus (STG/STS) also encode speaker-related vocal tract parameters, which are reflected in the amplitude peaks of the speech spectrum, along with the speech message. Right posterior STG/STS activated specifically more to a speaker-related vocal tract parameter change during a speech recognition task compared with a voice recognition task. Left and right posterior STG/STS were functionally connected. Additionally, we found that speaker-related glottal fold parameters (e.g., pitch), which are not reflected in the amplitude peaks of the speech spectrum, are processed in areas immediately adjacent to primary auditory cortex, i.e., in areas in the auditory hierarchy earlier than STG/STS. Our results point to a network account of speech recognition, in which information about the speech message and the speaker's vocal tract are combined to solve the difficult task of understanding speech from different speakers.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The contribution of glottal fold and vocal tract parameters to the speech output. A, Shown is a sagittal section through a human head and neck. Green circle, Glottal folds; blue lines, extension of the vocal tract from glottal folds to tip of the nose and lips. B, The three plots show three different sounds determined by glottal fold parameters. In voiced speech, the vibration of the glottal folds results in lower voices (120 Hz GPR; top) or higher voices (200 Hz GPR; middle). If glottal folds are constricted, they produce a noise-like sound that is heard as whispered speech (0 Hz GPR; bottom). C, The vocal tract filters the sound wave coming from the glottal folds, which introduces amplitude peaks at certain frequencies (“formants”; blue lines). Note that the different glottal fold parameters do not influence the formant position. D, Both speech- and speaker-related vocal tract parameters influence the position of the formants. Here we show as an example the formant shifts associated with the speech sounds /u/ and /a/ (first and second plot) and an /a/ with a shorter and longer vocal tract length (second and third plot).
Figure 2.
Figure 2.
BOLD responses associated with the main effect of VTL (red) and main effect of task (green) as revealed by the conjunction analysis of experiment 1 and experiment 2. The group mean structural image is overlaid with the statistical parametric maps for the respective contrasts. “Control task” refers to loudness task in experiment 1 and to speaker task in experiment 2. L, Left hemisphere; VTL, acoustic effect of vocal tract length. The dotted lines on the sagittal section indicate the slices displayed as horizontal and coronal sections. The plots show the parameter estimates for experiments 1 and 2 separately. The small bar graphs on top of the plots display the main effects and their significance threshold in a repeated-measures ANOVA. Results of post hoc t tests are indicated by the brackets within the plot. *p < 0.05, ***p < 0.001. ns, Nonsignificant. Error bars represent ±1 SEM.
Figure 3.
Figure 3.
BOLD responses associated with the interaction between task and VTL. The contrast for experiment 1 is rendered in magenta and for experiment 2 in cyan. The plots show the parameter estimates for experiments 1 and 2 separately [MNI coordinates: experiment 1, (52, −22, 0); experiment 2, (68, −42, 16)]. The small bar graphs on top of the plots show the significant interaction and main effects and their significance threshold in a repeated-measures ANOVA. Results of post hoc t test are indicated by the brackets within the plot. *p < 0.05. ns, Nonsignificant. Error bars represent ±1 SEM.
Figure 4.
Figure 4.
Overview of BOLD responses in right and left hemisphere. This figure also includes the BOLD responses reported in a previous study (von Kriegstein et al., 2007). The right-sided activation for the previous study is shown at a threshold of p < 0.003 for display purposes. The voxel with the maximum statistic for this study is at (60, −42, −2), Z = 3.12.
Figure 5.
Figure 5.
Functional connectivity (PPI) between left and right posterior STG/STS. Seed regions were taken from individual subject clusters; here the group mean is shown (red). Target regions identified by the PPI analysis (VTL × task, connectivity) are shown in green [MNI coordinates: experiment 1, (58, −46, 20), Z = 3.03; experiment 2, (60, −52, 20), Z = 3.26)]. BOLD responses associated with the interaction between task and VTL (VTL × task, activity) are displayed to demonstrate their consistently close proximity to PPI target regions in right posterior STG/STS.
Figure 6.
Figure 6.
BOLD responses for voiced and whispered speech. The group mean structural image is overlaid with the statistical parametric maps for the contrasts between (1) voiced > whispered speech (red), (2) whispered > voiced speech (yellow), and (3) pitch varies > VTL varies (cyan). The plot shows parameter estimates for voiced and whispered speech in Te1.2 and Te1.1 (volume of interest). Error bars represent ±1 SEM. A repeated-measures ANOVA with the factors location (Te1.1, Te1.2) and sound quality (voiced, whispered) revealed a significant interaction of sound quality × location (F(1,17) = 28, p < 0.0001), indicating differential responsiveness to whispered sounds in Te1.1 and to voiced sounds in Te1.2. ***p < 0.001.

Similar articles

Cited by

References

    1. Abercrombie D. Edinburgh: Edinburgh UP; 1967. Elements of general phonetics.
    1. Abrams DA, Nicol T, Zecker S, Kraus N. Right-hemisphere auditory cortex is dominant for coding syllable patterns in speech. J Neurosci. 2008;28:3958–3965. - PMC - PubMed
    1. Adank P, Devlin JT. On-line plasticity in spoken sentence comprehension: adapting to time-compressed speech. Neuroimage. 2010;49:1124–1132. - PMC - PubMed
    1. Adank P, van Hout R, Smits R. An acoustic description of the vowels of Northern and Southern Standard Dutch. J Acoust Soc Am. 2004;116:1729–1738. - PubMed
    1. Ames H, Grossberg S. Speaker normalization using cortical strip maps: a neural model for steady-state vowel categorization. J Acoust Soc Am. 2008;124:3918–3936. - PubMed

Publication types

MeSH terms

LinkOut - more resources