Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 7;11(3):e0150313.
doi: 10.1371/journal.pone.0150313. eCollection 2016.

Auditory Sketches: Very Sparse Representations of Sounds Are Still Recognizable

Affiliations

Auditory Sketches: Very Sparse Representations of Sounds Are Still Recognizable

Vincent Isnard et al. PLoS One. .

Abstract

Sounds in our environment like voices, animal calls or musical instruments are easily recognized by human listeners. Understanding the key features underlying this robust sound recognition is an important question in auditory science. Here, we studied the recognition by human listeners of new classes of sounds: acoustic and auditory sketches, sounds that are severely impoverished but still recognizable. Starting from a time-frequency representation, a sketch is obtained by keeping only sparse elements of the original signal, here, by means of a simple peak-picking algorithm. Two time-frequency representations were compared: a biologically grounded one, the auditory spectrogram, which simulates peripheral auditory filtering, and a simple acoustic spectrogram, based on a Fourier transform. Three degrees of sparsity were also investigated. Listeners were asked to recognize the category to which a sketch sound belongs: singing voices, bird calls, musical instruments, and vehicle engine noises. Results showed that, with the exception of voice sounds, very sparse representations of sounds (10 features, or energy peaks, per second) could be recognized above chance. No clear differences could be observed between the acoustic and the auditory sketches. For the voice sounds, however, a completely different pattern of results emerged, with at-chance or even below-chance recognition performances, suggesting that the important features of the voice, whatever they are, were removed by the sketch process. Overall, these perceptual results were well correlated with a model of auditory distances, based on spectro-temporal excitation patterns (STEPs). This study confirms the potential of these new classes of sounds, acoustic and auditory sketches, to study sound recognition.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. The sketch process.
(A) The panel shows the first step, i.e. the time-frequency representation of a sound; here, the auditory spectrogram of the original sound (a voice sound of a female alto singer singing an /a/ on a B4). (B) The panel represents the sparsification algorithm: the 25 highest peaks in the signal are selected, corresponding to the 100 feat./s sparsification level. Based on this sparse representation, a sketch sound is then resynthesized. (C) The panel displays the auditory spectrogram of this sketch sound.
Fig 2
Fig 2. Auditory spectrograms of original and sketch stimuli.
All panels are auditory time-frequency representations (Chi et al., 2005; see Original and sketch sounds section) of original and sketch stimuli. Left: original sounds; middle: auditory sketches (100 feat./s); right: acoustic sketches (100 feat./s). The sound examples are from the categories: (A) instruments: a harp playing a B4, (B) birds: a loon vocalization, (C) vehicles: a motorcycle, (D) voices: a female voice, singing the vowel /a/, B4.
Fig 3
Fig 3. Recognition performance.
(A) For each category, performance (as measured by d') is displayed at each sparsification level. With the exception of voice sounds, performance was well above chance even at the highest sparsification level, 10 feat./s. For voice sounds, performance was at chance or even lower (negative d'), meaning that participants responded systematically anything but voices for these voice sounds. (B) Performance is displayed for auditory sketches and acoustic sketches. For bird and voice stimuli, performances were higher with auditory sketches, whereas for vehicles, the reverse pattern emerged. No differences were observed for instrument sounds. Error bars correspond to the standard error of the mean.
Fig 4
Fig 4. Auditory distance model.
For each time-frequency representation (AudS: auditory spectrogram, and AcS: acoustic spectrogram) and each sparsification level (10, 100, and 1000 features per second), an auditory distance dissimilarity matrix is plotted (see [33]). The mean absolute distance between STEPs [34] is represented for each sound pair of each category (Inst. for musical instruments, Birds, Veh. for vehicle engine sounds, and Voices). With the high level of sparsity (10 feat./s), sounds are more similar between them than with the low level of sparsity (1000 feat./s). No obvious differences emerged between the two auditory or acoustic time-frequency representations.
Fig 5
Fig 5. The perceptual results (d') plotted as a function of the auditory distance values.
The filled symbols represent sketches based on the auditory spectrogram (AudS) representation; the open symbols are for the acoustic spectrogram (AcS) representation. The size of the symbols corresponds to the level of sparsity: small symbols for 10 feat/s; medium symbols for 100 feat/s; large symbols for 1000 feat/s. Error bars correspond to the standard error of the mean. A good correlation is exhibited between the model and the data.

References

    1. Ballas JA. Common factors in the identification of an assortment of brief everyday sounds. Journal of experimental psychology: human perception and performance. 1993;19(2):250 - PubMed
    1. Gygi B, Kidd GR, Watson CS. Spectral-temporal factors in the identification of environmental sounds. The Journal of the Acoustical Society of America. 2004;115(3):1252 - PubMed
    1. Felsen G, Dan Y. A natural approach to studying vision. Nature neuroscience. 2005;8(12):1643–6. - PubMed
    1. Suied C, Viaud-Delmon I. Auditory-visual object recognition time suggests specific processing for animal sounds. PloS one. 2009;4(4):e5256 10.1371/journal.pone.0005256 - DOI - PMC - PubMed
    1. Robinson K, Patterson RD. The stimulus duration required to identify vowels, their octave, and their pitch chroma. The Journal of the Acoustical Society of America. 1995;98(4):1858–65.

Publication types

LinkOut - more resources