Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;8(7):e1002594.
doi: 10.1371/journal.pcbi.1002594. Epub 2012 Jul 12.

Sparse codes for speech predict spectrotemporal receptive fields in the inferior colliculus

Affiliations

Sparse codes for speech predict spectrotemporal receptive fields in the inferior colliculus

Nicole L Carlson et al. PLoS Comput Biol. 2012.

Abstract

We have developed a sparse mathematical representation of speech that minimizes the number of active model neurons needed to represent typical speech sounds. The model learns several well-known acoustic features of speech such as harmonic stacks, formants, onsets and terminations, but we also find more exotic structures in the spectrogram representation of sound such as localized checkerboard patterns and frequency-modulated excitatory subregions flanked by suppressive sidebands. Moreover, several of these novel features resemble neuronal receptive fields reported in the Inferior Colliculus (IC), as well as auditory thalamus and cortex, and our model neurons exhibit the same tradeoff in spectrotemporal resolution as has been observed in IC. To our knowledge, this is the first demonstration that receptive fields of neurons in the ascending mammalian auditory pathway beyond the auditory nerve can be predicted based on coding principles and the statistical properties of recorded sounds.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Schematic illustration of our sparse coding model.
(a) Stimuli used to train the model consisted of examples of recorded speech. The blue curve represents the raw sound pressure waveform of a woman saying, “The north wind and the sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.” (b) The raw waveforms were first put through one of two preprocessing steps meant to model the earliest stages of auditory processing to produce either a spectrogram or a “cochleogram” (not shown; see Methods for details). In either case, the power spectrum across acoustic frequencies is displayed as a function of time, with warmer colors indicating high power content and cooler colors indicating low power. (c) The spectrograms were then divided into overlapping 216 ms segments. (d) Subsequently, principal components analysis (PCA) was used to project each segment onto the space of the first two hundred principal components (first ten shown), in order to reduce the dimensionality of the data to make it tractable for further analysis while retaining its basic structure . (e) These projections were then input to a sparse coding network in order to learn a “dictionary” of basis elements analogous to neuronal receptive fields, which can then be used to form a representation of any given stimulus (i.e., to perform inference). We explored networks capable of learning either “hard” (L0) sparse dictionaries or “soft” (L1) sparse dictionaries (described in the text and Methods) that were undercomplete (fewer dictionary elements than PCA components), complete (equal number of dictionary elements), or over-complete (greater number of dictionary elements).
Figure 2
Figure 2. A half-complete sparse coding dictionary trained on cochleogram representations of speech.
This dictionary exhibits a limited range of shapes. The full set of 100 elements from a half-complete, L0-sparse dictionary trained on cochleograms of human speech resemble those found in a previous study . Nearly all elements are extremely smooth, with most consisting of a single frequency subfield or an unmodulated harmonic stack. Each rectangle can be thought of as representing the spectro-temporal receptive field (STRF) of a single element in the dictionary (see Methods for details); time is plotted along the horizontal axis (from 0 to 250 ms), and log frequency is plotted along the vertical axis, with frequencies ranging from 73 Hz to 7630 Hz. Color indicates the amount of power present at each frequency at each moment in time, with warm colors representing high power and cool colors representing low power. Each element has been normalized to have unit Euclidean length. Elements are arranged in order of their usage during inference (i.e., when used to represent individual sounds drawn from the training set) with usage increasing from left to right along each row, and all elements of lower rows used more than those of higher rows.
Figure 3
Figure 3. A half-complete, L0-sparse dictionary trained on spectrograms of speech.
This dictionary exhibits a variety of distinct shapes that capture several classes of acoustic features present in speech and other natural sounds. (a–f) Selected elements from the dictionary that are representative of different types of receptive fields: (a) a harmonic stack; (b) an onset element; (c) a harmonic stack with flanking suppression; (d) a more localized onset/termination element; (e) a formant; (f) a tight checkerboard pattern (see Fig. S1 for the full dictionary). Each rectangle represents the spectro-temporal receptive field (STRF) of a single element in the dictionary; time is plotted along the horizontal axis (from 0 to 216 msec) and log frequency is plotted along the vertical axis, with frequencies ranging from 100 Hz to 4000 Hz. (g) A graph of the usage of the dictionary elements showing that the different types of receptive field shapes separate based on usage into a series of rises and plateaus; red symbols indicate where each of the examples from panels a–f fall on the graph. The vertical axis represents the number of stimuli that required a given dictionary element in order to be represented accurately during inference.
Figure 4
Figure 4. A four-times overcomplete, L0-sparse dictionary trained on speech spectrograms.
This dictionary shows a greater diversity of shapes than the undercomplete dictionaries. (a–l) Representative elements a, c, e, g, j, and l resemble those of the half-complete dictionary (see Fig. 3 ). Other neurons display more complex shapes than those found in less overcomplete dictionaries: (b) a harmonic stack with flanking suppressive subregions; (d) a neuron sensitive to lower frequencies; (f) a short harmonic stack; (h) a localized but complex pattern of excitation with flanking suppression; (i) a localized checkerboard with larger excitatory and suppressive subregions than those in panel l; (k) a checkerboard pattern that extends for many cycles in time. Several of these patterns resemble neural spectro-temporal receptive fields (STRFs) reported in various stages of the auditory pathway that have not been predicted by previous theoretical models (see text and Figs. 6 8 ). (m) A graph of usage of the dictionary elements during inference. The different classes of dictionary elements still separate according to usage (see Fig. S4 for the full dictionary) although the notable rises and plateaus as seen in Fig. 3g are less apparent in this larger dictionary.
Figure 5
Figure 5. Our overcomplete, spectrogram-trained model exhibits similar spectrotemporal tradeoff as Inferior Coliculus.
Modulation spectra of half-complete cochleogram-trained dictionary and four-times overcomplete spectroram-trained dictionary are shown. The four-times overcomplete spectrogram-trained dictionary elements (red dots; same dictionary as in Fig. 4 ) display a clear tradeoff between spectral and temporal modulations, similar to what has been reported for Inferior Colliculus (IC) . By contrast, the half-complete cochleogram-trained dictionary (blue circles; same dictionary as in Fig. 2 ) exhibits a much more limited range of temporal modulations, with no such tradeoff in spectrotemporal resolution. Each data point represents the centroid of the modulation spectrum of the corresponding element. The elements shown in Fig. 4 are indicated on the graph with the same symbols as before.
Figure 6
Figure 6. Model comparisons to receptive fields from auditory midbrain.
Complete and overcomplete sparse coding models trained on spectrograms of speech predict Inferior Colliculus (IC) spectro-temporal receptive field (STRF) shapes with excitatory and suppressive subfields that are localized in frequency but separated in time. (a) Two examples of Gerbil IC neural STRFs exhibiting ON-type response patterns with excitation following suppression; data courtesy of N.A. Lesica. (b) Representative model dictionary elements from each of three dictionaries that match this pattern of excitation and suppression. The three dictionaries were all trained on spectrogram representations of speech, using a hard sparseness (L0) penalty; the representations were complete (left column; Fig. S2), two-times overcomplete (middle column; Fig. S3), and four-times overcomplete (right column; Fig. 4 and Fig. S4). (c) Two example neuronal STRFs from cat IC exhibiting OFF-type patterns with excitation preceding suppression; data courtesy of M.A. Escabí. (d) Other model neurons from the same set of three dictionaries as in panel b also exhibit this OFF-type pattern.
Figure 7
Figure 7. Model comparisons to receptive fields from auditory midbrain and thalamus.
An overcomplete sparse coding model trained on spectrograms of speech predicts Inferior Colliculus (IC) and auditory thalamus (ventral division of the medial geniculate body; MGBv) spectro-temporal receptive fields (STRFs) consisting of localized checkerboard patterns containing roughly four to nine distinct subfields. (a) Example STRFs of localized checkerboard patterns from two Gerbil IC neurons , one cat IC neuron , and one cat MGBv neuron (top to bottom). Data courtesy of N.A. Lesica (top two cells) and M.A. Escabí (bottom two cells). (b) Elements from the four-times overcomplete, L0-sparse, spectrogram-trained dictionary with similar checkerboard patterns as the neurons in panel a.
Figure 8
Figure 8. Model comparisons to receptive fields from auditory midbrain and cortex.
on spectrograms of speech predicts several classes of broadband spectro-temporal receptive field (STRF) shapes found in Inferior Colliculus (IC) and primary auditory cortex (A1). (a,b) An example broadband OFF-type STRF from cat IC (top; data courtesy of M.A. Escabí) and an example broadband ON-type subthreshold STRF from rat A1 (bottom; data courtesy of M. Wehr) shown in panel a resemble example elements from a four-times overcomplete, L0-sparse, spectrogram-trained dictionary shown in panel b. (c) STRFs from a bat IC neuron (top; data courtesy of S. Andoni) and a cat A1 neuron (bottom; data courtesy of M.A. Escabí) each consist of a primary excitatory subfield that is modulated in frequency over time, flanked by similarly angled suppressive subfields. (d) Example STRFs from four elements taken from the same dictionary as in panels b exhibit similar patterns as the neuronal STRFs in panel c.

References

    1. Laughlin SB. Energy as a constraint on the coding and processing of sensory information. Curr Opin Neurobiol. 2001;11:475–480. - PubMed
    1. Attneave F. Some informational aspects of visual perception. Psychol Rev. 1954;61:183–193. - PubMed
    1. Barlow HB. Possible principles underlying the transformations of sensory messages. In: Rosenblith W, editor. Sensory Communication. Cambridge: MIT Press; 1961. pp. 217–234. editor.
    1. Atick JJ, Redlich AN. What does the retina know about natural scenes? Neural Comput. 1992;4:196–210.
    1. Laughlin SB. A simple coding procedure enhances a neuron's information capacity. Z Naturforsch. 1981;36c:910–912. - PubMed

Publication types

LinkOut - more resources