Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 23:12:769663.
doi: 10.3389/fpsyg.2021.769663. eCollection 2021.

Listening in the Mix: Lead Vocals Robustly Attract Auditory Attention in Popular Music

Affiliations

Listening in the Mix: Lead Vocals Robustly Attract Auditory Attention in Popular Music

Michel Bürgel et al. Front Psychol. .

Abstract

Listeners can attend to and track instruments or singing voices in complex musical mixtures, even though the acoustical energy of sounds from individual instruments may overlap in time and frequency. In popular music, lead vocals are often accompanied by sound mixtures from a variety of instruments, such as drums, bass, keyboards, and guitars. However, little is known about how the perceptual organization of such musical scenes is affected by selective attention, and which acoustic features play the most important role. To investigate these questions, we explored the role of auditory attention in a realistic musical scenario. We conducted three online experiments in which participants detected single cued instruments or voices in multi-track musical mixtures. Stimuli consisted of 2-s multi-track excerpts of popular music. In one condition, the target cue preceded the mixture, allowing listeners to selectively attend to the target. In another condition, the target was presented after the mixture, requiring a more "global" mode of listening. Performance differences between these two conditions were interpreted as effects of selective attention. In Experiment 1, results showed that detection performance was generally dependent on the target's instrument category, but listeners were more accurate when the target was presented prior to the mixture rather than the opposite. Lead vocals appeared to be nearly unaffected by this change in presentation order and achieved the highest accuracy compared with the other instruments, which suggested a particular salience of vocal signals in musical mixtures. In Experiment 2, filtering was used to avoid potential spectral masking of target sounds. Although detection accuracy increased for all instruments, a similar pattern of results was observed regarding the instrument-specific differences between presentation orders. In Experiment 3, adjusting the sound level differences between the targets reduced the effect of presentation order, but did not affect the differences between instruments. While both acoustic manipulations facilitated the detection of targets, vocal signals remained particularly salient, which suggest that the manipulated features did not contribute to vocal salience. These findings demonstrate that lead vocals serve as robust attractor points of auditory attention regardless of the manipulation of low-level acoustical cues.

Keywords: auditory attention; music mixing; polyphonic music; singing voice; vocal salience.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Schematic overview of the experiments. (A) Procedure: The experiment started with a headphone screening task, followed by a subjective sound level calibration, a training section where participants were familiarized with the instrument detection task and finally the main experimental section. (B) Task: An instrument detection task was used in the experiments: Participants either took part in an experiment where the targets were preceding the mixtures or where the mixtures were preceding the targets. (C) Stimuli modification: In the first experiment, excerpts unmodified from their original state were used. In the second experiment, the targets were filtered in an octave band to create a spectral region in which the target could pass without being spectrally masked. In the third experiment, the individual sound level differences between the diverse vocals and instruments were adjusted to one of three possible level ratios.
Figure 2
Figure 2
Stimuli extraction. Short excerpts from a multitrack database containing reproductions of popular music were used as stimuli. The schematic shows the workflow of the stimulus construction. For details, see the text.
Figure 3
Figure 3
Detection accuracy in Experiment 1. Five instrument and vocal categories were used as targets (lead vocals, drums, synthesizer, piano, and bass). The Square marks the mean detection accuracy for a given target category. Error bars indicate 95% CIs. Asterisks represent the average accuracy of an individual participant for the given target category. “TAR” denotes the presentation order “Target-Mixture” where the target cue was presented followed by a mixture. “MIX” denotes the presentation order “Mixture-Target” where a mixture was presented followed by the target cue.
Figure 4
Figure 4
Database feature analysis. We analyzed the average sound level in ERB-bands (A) and broadband sound level (B) between each voice or instrument and the remaining mixture for each song. (A) Each colored line represents the average sound level for the given center frequency. The filled area represents the 95% CIs for the lead vocals. (B) The circle marks the mean detection accuracy for a given target category. Error bars indicate 95% CIs. Crosses represent the average level of an individual song for the given target category.
Figure 5
Figure 5
Detection accuracy in Experiment 2. Three instrument and vocal categories were used as targets (lead vocals, guitar, and piano). Either a bandpass or bandstop was applied to the filter and the mixture. The target filter type is listed in the upper area of the figure with TBP indicating a bandpass was used and TBS indicating a bandstop was used. The Square marks the mean detection accuracy for a given target category. Error bars indicate 95% CIs. Asterisks represent the average accuracy of an individual participant (n = 40) for the given target category. “TAR” denotes the presentation order “Target-Mixture” where the target cue was presented followed by a mixture. “MIX” denotes the presentation order “Mixture-Target” where a mixture was presented followed by the target cue.
Figure 6
Figure 6
Detection accuracy in Experiment 3. Three instrument and vocal categories were used as targets (lead vocals, bass, and others = drums, guitar, piano, strings, synthesizer, and winds). The sound level ratio between the target and mixture was adjusted to either −5, −10, and −15 dB and is listed in the upper area of the figure, decreasing from right to left. The Square marks the mean detection accuracy for a given target category. Error bars indicate 95% CIs. Asterisks represent data from individual participants for the given target category. “TAR” denotes the presentation order “Target-Mixture” where the target cue was presented followed by a mixture. “MIX” denotes the presentation order “Mixture-Target” where a mixture was presented followed by the target cue. The green cross above the lead voice in the −15 dB condition marks the averaged detection accuracy when all stimuli that were consistently answered incorrectly were excluded (for details, see Results).

References

    1. Agus T. R., Suied C., Thorpe S. J., Pressnitzer D. (2012). Fast recognition of musical sounds based on timbre. J. Acoust. Soc. Am. 131, 4124–4133. doi: 10.1121/1.3701865, PMID: - DOI - PubMed
    1. Alain C., Bernstein L. J. (2015). Auditory scene analysis. Music. Percept. 33, 70–82. doi: 10.1525/mp.2015.33.1.70, PMID: - DOI - PubMed
    1. Barrett K. C., Ashley R., Strait D. L., Erika S., Limb C. J., Kraus N. (2021). Multi-voiced music bypasses attentional limitations in the brain. Front. Neurosci. 15:588914. doi: 10.3389/fnins.2021.588914, PMID: - DOI - PMC - PubMed
    1. Başkent D., Fuller C. D., Galvin J. J., III, Schepel L., Gaudrain E., Free R. H. (2018). Musician effect on perception of spectro-temporally degraded speech, vocal emotion, and music in young adolescents. J. Acoust. Soc. Am. 143, EL311–EL316. doi: 10.1121/1.5034489, PMID: - DOI - PubMed
    1. Bates D., Mächler M., Bolker B., Walker S. (2015). Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67. doi: 10.18637/jss.v067.i01, PMID: - DOI - PubMed

LinkOut - more resources