Effects of Spatial Speech Presentation on Listener Response Strategy for Talker-Identification

Stefan Uhrig^{1

2}, Andrew Perkis¹, Sebastian Möller^{2

3}, U Peter Svensson¹, Dawn M Behne⁴

Affiliations

¹ Department of Electronic Systems, Norwegian University of Science and Technology, Trondheim, Norway.
² Quality and Usability Lab, Technische Universität Berlin, Berlin, Germany.
³ Speech and Language Technology, German Research Center for Artificial Intelligence, Berlin, Germany.
⁴ Department of Psychology, Norwegian University of Science and Technology, Trondheim, Norway.

PMID: 35153653
PMCID: PMC8831717
DOI: 10.3389/fnins.2021.730744

Effects of Spatial Speech Presentation on Listener Response Strategy for Talker-Identification

Stefan Uhrig et al. Front Neurosci. 2022.

. 2022 Jan 28:15:730744.

doi: 10.3389/fnins.2021.730744. eCollection 2021.

Authors

Stefan Uhrig^{1

2}, Andrew Perkis¹, Sebastian Möller^{2

3}, U Peter Svensson¹, Dawn M Behne⁴

Affiliations

¹ Department of Electronic Systems, Norwegian University of Science and Technology, Trondheim, Norway.
² Quality and Usability Lab, Technische Universität Berlin, Berlin, Germany.
³ Speech and Language Technology, German Research Center for Artificial Intelligence, Berlin, Germany.
⁴ Department of Psychology, Norwegian University of Science and Technology, Trondheim, Norway.

PMID: 35153653
PMCID: PMC8831717
DOI: 10.3389/fnins.2021.730744

Abstract

This study investigates effects of spatial auditory cues on human listeners' response strategy for identifying two alternately active talkers ("turn-taking" listening scenario). Previous research has demonstrated subjective benefits of audio spatialization with regard to speech intelligibility and talker-identification effort. So far, the deliberate activation of specific perceptual and cognitive processes by listeners to optimize their task performance remained largely unexamined. Spoken sentences selected as stimuli were either clean or degraded due to background noise or bandpass filtering. Stimuli were presented via three horizontally positioned loudspeakers: In a non-spatial mode, both talkers were presented through a central loudspeaker; in a spatial mode, each talker was presented through the central or a talker-specific lateral loudspeaker. Participants identified talkers via speeded keypresses and afterwards provided subjective ratings (speech quality, speech intelligibility, voice similarity, talker-identification effort). In the spatial mode, presentations at lateral loudspeaker locations entailed quicker behavioral responses, which were significantly slower in comparison to a talker-localization task. Under clean speech, response times globally increased in the spatial vs. non-spatial mode (across all locations); these "response time switch costs," presumably being caused by repeated switching of spatial auditory attention between different locations, diminished under degraded speech. No significant effects of spatialization on subjective ratings were found. The results suggested that when listeners could utilize task-relevant auditory cues about talker location, they continued to rely on voice recognition instead of localization of talker sound sources as primary response strategy. Besides, the presence of speech degradations may have led to increased cognitive control, which in turn compensated for incurring response time switch costs.

Keywords: response strategy; sound localization; spatial auditory attention; spatial auditory cues; speech perception; switch costs; talker-identification; voice recognition.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Test layout deployed in the present study. The listener sits at a table, facing an array of three left (L), central (C), and right (R) loudspeakers (L = −30°, C = 0°, R = 30° azimuth) at a distance of approximately 2.15 m. The listener responds to stimuli by pressing keys on a response pad, while fixating a white cross displayed on a monitor screen below C. This figure originally appeared in Uhrig et al. (2020a), copyright 2020, with permission from IEEE.

**Figure 2**
Three continuous rating scales were employed for subjective assessment of perceived speech quality (top scale), speech intelligibility (top scale), voice similarity (middle scale), and talker-identification effort (bottom scale). A seven-point “extended continuous scale” design was implemented in accordance with ITU-T Recommendation P.851 (2003). Another version of this figure originally appeared in Uhrig et al. (2020a), copyright 2020, with permission from IEEE.

**Figure 3**
Effects of *presentation mode* and *speech degradation* on rating for evaluative (speech quality, speech intelligibility) and task-related [voice similarity, talker-identification (TI) effort] attributes of overall listening experience. The numeric range of the y-axis (1–7) corresponds to scale labels shown in Figure 2. Error bars represent 95% confidence intervals. Another version of this figure originally appeared in Uhrig et al. (2020a), copyright 2020, with permission from IEEE.

**Figure 4**
Effects of *presentation mode* and *speech degradation* on correct response time. Error bars represent 95% confidence intervals. Another version of this figure originally appeared in Uhrig (2022), copyright 2021, with permission from Springer.

**Figure 5**
Effects of *speech degradation* and *loudspeaker location* on correct response time in the spatial_id mode. Color-shaded bars represent 95% confidence ranges for the non-spatial_id mode under different *speech degradation* levels (i.e., black bar = non-spatial_id/clean, purple bar = non-spatial_id/noisy, green bar = non-spatial_id/filtered), as depicted in Figure 4. Diamond-shaped points represent the spatial_loc mode (talker-localization task). Error bars represent 95% confidence intervals. The dashed horizontal line at 700 ms marks the lower y-axis limit in Figure 4, for better comparability. Another version of this figure originally appeared in Uhrig (2022), copyright 2021, with permission from Springer.

**Figure 6**
Effects of *loudspeaker location, spatial block* (spatial block 1, spatial block 2, spatial block 3), and *lateral trial half* (first trial half, second trial half) on correct response time. Error bars represent 95% confidence intervals. Another version of this figure originally appeared in Uhrig (2022), copyright 2021, with permission from Springer.

**Figure 7**
Effects of *presentation mode* and *speech degradation* on correct response rate in the spatial_id mode. Open square-shaped points represent the non-spatial_id mode only (involving presentations only at the central loudspeaker location). Error bars represent 95% confidence intervals.

See this image and copyright information in PMC

References

1. Allen K., Carlile S., Alais D. (2008). Contributions of talker characteristics and spatial location to auditory streaming. J. Acoust. Soc. Amer. 123, 1562–1570. 10.1121/1.2831774 - DOI - PubMed
1. Baer T., Moore B. C. J., Gatehouse S. (1993). Spectral contrast enhancement of speech in noise for listeners with sensorineural hearing impairment: effects on intelligibility, quality, and response times. J. Rehabil. Res. Dev. 30, 49–72. - PubMed
1. Baldis J. J. (2001). Effects of spatial audio on memory, comprehension, and preference during desktop conferences, in Proceedings of the SIGCHI conference on Human factors in computing systems - CHI '01 (Seattle, WA: ACM Press; ), 166–173. 10.1145/365024.365092 - DOI
1. Begau A., Klatt L.-I., Wascher E., Schneider D., Getzmann S. (2021). Do congruent lip movements facilitate speech processing in a dynamic audiovisual multi-talker scenario? An ERP study with older and younger adults. Behav. Brain Res. 412, 113436. 10.1016/j.bbr.2021.113436 - DOI - PubMed
1. Best V., Ahlstrom J. B., Mason C. R., Roverud E., Perrachione T. K., Kidd G., et al. . (2018). Talker identification: effects of masking, hearing loss, and age. J. Acoust. Soc. Amer. 143, 1085–1092. 10.1121/1.5024333 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Effects of Spatial Speech Presentation on Listener Response Strategy for Talker-Identification

Affiliations

Effects of Spatial Speech Presentation on Listener Response Strategy for Talker-Identification

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources