Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2015 Jul;77(5):1465-87.
doi: 10.3758/s13414-015-0882-9.

The cocktail-party problem revisited: early processing and selection of multi-talker speech

Affiliations
Review

The cocktail-party problem revisited: early processing and selection of multi-talker speech

Adelbert W Bronkhorst. Atten Percept Psychophys. 2015 Jul.

Abstract

How do we recognize what one person is saying when others are speaking at the same time? This review summarizes widespread research in psychoacoustics, auditory scene analysis, and attention, all dealing with early processing and selection of speech, which has been stimulated by this question. Important effects occurring at the peripheral and brainstem levels are mutual masking of sounds and "unmasking" resulting from binaural listening. Psychoacoustic models have been developed that can predict these effects accurately, albeit using computational approaches rather than approximations of neural processing. Grouping—the segregation and streaming of sounds—represents a subsequent processing stage that interacts closely with attention. Sounds can be easily grouped—and subsequently selected—using primitive features such as spatial location and fundamental frequency. More complex processing is required when lexical, syntactic, or semantic information is used. Whereas it is now clear that such processing can take place preattentively, there also is evidence that the processing depth depends on the task-relevancy of the sound. This is consistent with the presence of a feedback loop in attentional control, triggering enhancement of to-be-selected input. Despite recent progress, there are still many unresolved issues: there is a need for integrative models that are neurophysiologically plausible, for research into grouping based on other than spatial or voice-related cues, for studies explicitly addressing endogenous and exogenous attention, for an explanation of the remarkable sluggishness of attention focused on dynamically changing sounds, and for research elucidating the distinction between binaural speech perception and sound localization.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Measures of “informational masking” derived from data collected by Freyman et al. (; panel a) and Iyer et al. (; panel b) in conditions where target speech was presented together different types of two-talker interference. Shown are differences between speech perception scores, averaged over SNRs of −8, −4, and 0 dB, for reference and test conditions. a Scores for test conditions using F-RF presentation from which scores for reference conditions, using FF presentation of the same sounds, were subtracted. b Data obtained by subtracting scores for test conditions with speech interference from a reference condition with two modulated noise signals.
Fig. 2
Fig. 2
Data from experiments of Best et al. (2008) and Best et al. (2010), in which one target and four interfering strings of digits were presented from different loudspeakers placed in an arc in front of the listener. a Increases in scores occurring when target digits are presented from a fixed location, instead of from “restricted” locations changing at most one loudspeaker position at a time, or from random locations. The changes could be cued with lights either at the time of change or in advance. The condition “predictable locations” employed the same sequence of locations throughout a block. b Increases are shown occurring when the string of target digits was spoken by a single voice, instead of different voices for each digit. The “simultaneous cue” condition in this case used random locations. All results are for an inter-digit delay of 250 ms, except those for the “predictable locations” condition, to which a correction factor was applied because they were only measured for a delay of 0 ms
Fig. 3
Fig. 3
Conceptual model of early speech processing. After peripheral and binaural processing, transients can already trigger attention. Primitive grouping (e.g., based on spatial location or F0) represents a subsequent stage, allowing efficient selection. More sophisticated features, such as syntactic and semantic information, are processed at a higher level and enable selection based on complex information. An important element of the model is a feedback loop, initiated by attentional control, inducing enhancement of to-be-selected input. See the text for more details

References

    1. Ahveninen, J., Hämäläinen, M., Jääskeläinen, I.P., Ahlfors, S.P., Huang, S., Lin, F.-H., ⋯Belliveau, J.W. (2011). Attention- driven auditory cortex short-term plasticity helps segregate relevant sounds from noise. Proceedings of the National Academy of Sciences 108, 4182–4187. doi:10.1073/pnas.1016134108. - PMC - PubMed
    1. Ahveninen, J., Jääskeläinen, I. P., Raij, T., Bonmassar, G., Devore, S., Hämäläinen, M., ⋯Belliveau, J. W. (2006). Task-modulated “what” and “where” pathways in human auditory cortex. Proceedings of the National Academy of Sciences 103, 14608–14613. doi:10.1073/pnas.0510480103. - PMC - PubMed
    1. Alain C, Arnott SR, Hevenor S, Graham S, Grady CL. “What” and “where” in the human auditory system. Proceedings of the National Academy of Sciences. 2001;98:12301–12306. - PMC - PubMed
    1. Allen JB. How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing. 1994;2:567–577.
    1. Allen K, Alais D, Carlile S. Speech intelligibility reduces over distance from an attended location: Evidence for an auditory spatial gradient of attention. Attention, Perception, & Psychophysics. 2009;71:164–173. - PubMed