Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 6;45(32):e1751242025.
doi: 10.1523/JNEUROSCI.1751-24.2025.

Interference of Mid-level Speech and Noise Statistics Underlies Human Speech Recognition Sensitivity in Natural Environmental Noise

Affiliations

Interference of Mid-level Speech and Noise Statistics Underlies Human Speech Recognition Sensitivity in Natural Environmental Noise

Alex C Clonan et al. J Neurosci. .

Abstract

Recognizing speech in noise, such as in a busy restaurant, is an essential cognitive skill where the task difficulty varies across environments and noise levels. Although there is growing evidence that the auditory system relies on statistical representations for perceiving and coding natural sounds, it is less clear how statistical cues and neural representations contribute to segregating speech in natural auditory scenes. Here we demonstrate that male and female human listeners rely on mid-level statistics to segregate and recognize speech in environmental noise. Using natural backgrounds and variants with perturbed spectrotemporal statistics, we show that speech recognition accuracy at a fixed noise level varies extensively across natural backgrounds (0-100%). Furthermore, for each background the unique interference created by summary statistics can mask or unmask speech, thus hindering or improving speech recognition. To identify the neural coding strategy and statistical cues that influence accuracy, we developed generalized perceptual regression, a framework that links summary statistics from a neural model to word recognition accuracy. Whereas summary statistics from a peripheral cochlear model account for only 60% of perceptual variance, summary statistics from a mid-level auditory midbrain model accurately predict single-trial sensory judgments, accounting for >90% of the perceptual variance. Furthermore, perceptual weights from the regression framework identify which statistics and tuned neural filters are influential and how they impact recognition. Thus, perception of speech in natural backgrounds relies on a mid-level auditory representation involving interference of multiple summary statistics that impact recognition beneficially or detrimentally across natural background sounds.

Keywords: auditory midbrain; cocktail party problem; natural sounds; neural network; sound statistics; speech in noise; speech recognition.

PubMed Disclaimer

Update of

References

    1. Andoni S, Li N, Pollak GD (2007) Spectrotemporal receptive fields in the inferior colliculus revealing selectivity for spectral motion in conspecific vocalizations. J Neurosci 27:4882–4893. 10.1523/JNEUROSCI.4342-06.2007 - DOI - PMC - PubMed
    1. Bacon SP, Grantham DW (1989) Modulation masking: effects of modulation frequency, depth, and phase. J Acoust Soc Am 85:2575–2580. 10.1121/1.397751 - DOI - PubMed
    1. Carandini M (2024) Sensory choices as logistic classification. Neuron 112:2854–2868. 10.1016/j.neuron.2024.06.016 - DOI - PMC - PubMed
    1. Cherry EC (1953) Some experiments on the recognition of speech, with one and with two ears. J Acoust Soc Am 5:975–979. 10.1121/1.1907229 - DOI
    1. Chi T, Gao Y, Guyton MC, Ru P, Shamma S (1999) Spectro-temporal modulation transfer functions and speech intelligibility. J Acoust Soc Am 106:2719–2732. 10.1121/1.428100 - DOI - PubMed

LinkOut - more resources