Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;9(9):e1003219.
doi: 10.1371/journal.pcbi.1003219. Epub 2013 Sep 12.

From birdsong to human speech recognition: bayesian inference on a hierarchy of nonlinear dynamical systems

Affiliations

From birdsong to human speech recognition: bayesian inference on a hierarchy of nonlinear dynamical systems

Izzet B Yildiz et al. PLoS Comput Biol. 2013.

Abstract

Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Summary of the hierarchical model of speech learning and recognition.
The core of the model is equivalent to the core of the birdsong model . The Equations 1 and 2 on the right side generate the dynamics shown on the left side, and are described in the Model section (see also Table 1 for the meaning of parameters). Speech sounds, i.e., sound waves, enter the model through the cochlear level. The output is a cochleagram (shown for the speech stimulus “zero”), which is a type of frequency-time diagram. There are 86 channels, which represent the firing rate (warm colors for high firing rate and cold colors for low firing rate) of the neuronal ensembles that encode lower frequencies as the channel number increases. We decrease the dimension of this input to six dimensions by averaging every 14 channels (see the color coding to the right of the cochleagram and also see Model). After this cochlear processing, activity is fed forward into the two-level hierarchical model. This input is encoded by the activity of the first level network (shown with the same color coding on the right), which is in turn encoded by activity at the second level (no color coding at this level, different colors represent different neuronal ensembles). From the generative model shown here (core model), we derived a recognition model (for mathematical details see Model).
Figure 2
Figure 2. Schematic structure of an agent and a module.
A) An agent consists of several modules, where each module contains an instance of the model shown in Figure 1 and has learned to recognize a single word. Sensory input is recognized by all modules concurrently and each module experiences prediction error during recognition. A module can be considered as a sophisticated dynamic, Bayes-optimal template matcher which produces less prediction error if the stimulus matches better to the module's learned word. A minimum operator performs classification by selecting the module with the least amount of prediction error during recognition. B) At each level in a module, causal and hidden states (formula image and formula image, respectively) try to minimize the precision-weighted prediction errors (formula image and formula image) by exchanging messages. Predictions are transferred from second level to the first and prediction error is propagated back from the first to the second level (see section Model: Learning and Recognition for more details). Adapted from .
Figure 3
Figure 3. Schema of ideal precision settings, at the first and second levels of a module, for learning and recognition under noise.
The precision of a population at each level is indicated by the line thickness around the symbols, and the influence of a population over another is indicated by arrow strength. A) During learning, the precision ratio at the first level (precision of the sensory states, i.e., causal states, over precision of the internal (hidden) dynamics) should be high. Consequently, the internal dynamics at the first level are dominated by the dynamics of the sensory input. At the second level, a very high precision makes sure that the module is forced to explain the sensory input as sequential dynamics by updating (learning) the connections between first and second levels (the I's in the first line of Equation 2). B) Under noisy conditions, the sensory input is not reliable and recognition performance is best if the precision at the sensory level is low compared to the precision of the internal dynamics at both levels (low sensory/internal precision ratio). This allows the module to rely on its (previously learned) internal dynamics, but less-so on the noisy sensory input. For the exact values of the precision settings in each scenario, see Text S1.
Figure 4
Figure 4. Generated neuronal network activity at the first level after learning.
The solid lines represent the cochleagram dynamics obtained from the stimulus (the word “zero”, the same stimulus as shown in Figure 1) that the module had to learn. Neuronal activity was normalized to one. The dashed lines represent the neuronal activity generated by the module after learning and shows that the module has successfully learned the proper I vectors between two levels.
Figure 5
Figure 5. Invariance of the recognition model to variation in speech rate.
A) The normal length stimulus “eight” (400 ms, top panel) has been learned and recognized successfully by the module “eight” (M8). For clarity, we only show the second level causal states (see Model). The same module (without any parameter adaptation) successfully recognizes a time-compressed version of the same stimulus (300 ms, middle panel). For comparison, the module trained on a digit “three” (M3) fails to reconstruct its expected dynamics when exposed to “eight” (bottom panel). B) The total prediction errors produced at the second level hidden states by ten different modules (M0 to M9), which were previously trained on the corresponding digits with normal length, are shown. All modules were exposed to the same 25% time compressed “eight” stimulus. Module M8 (red arrow) produces the lowest prediction error and shows that prediction error can be used for classification, even though the stimulus is time compressed.
Figure 6
Figure 6. Performance of the recognition model in “cocktail party” situations.
A module is trained on an auditory sentence (“She argues with her sister”) without competing speakers and tested for recognition of this sentence in three conditions: Left column) No competing speaker, Middle column) one competing speaker, and Right column) three competing speakers. Each column shows the second level dynamics, first level dynamics and cochleagram with arbitrary units in neuronal activation. Second level dynamics were successfully reconstructed for the single speaker and also, to an extent, for the speech sample with one competing speaker. In the case of three competing speakers, the module was not able to reconstruct the second level dynamics completely, but showed some signs of recovery at the beginning and at the end of the sentence. Note that the increasing difficulty in reconstruction of the speech message from one to three speakers is not reflected in the prediction errors at the first level (dashed lines), but becomes obvious at the second level.
Figure 7
Figure 7. Accent adaptation of the recognition model.
A) The cochleagrams represent two utterances of “eight”. A module originally learned the word “eight” spoken with a British (North England) accent (top) and then recognized an “eight” spoken with a New Zealand accent (bottom). B) The module trained on the British accent was allowed to adapt to the New Zealand accent with the corresponding precision values for the first level sensory (causal) and internal (hidden) states (sensory log-precision: formula image and internal log-precision: formula image where formula image from left to right). For each precision ratio, we plotted the reduction in prediction error (of the causal states, see Model) after five repetitions of the word “eight” spoken with a New Zealand accent. As expected, accent adaptation was accomplished only with high sensory/internal precision ratios (resulting in greatly reduced prediction errors) whereas no adaptation occurred (prediction errors remained high) when this ratio was low.
Figure 8
Figure 8. Qualitative modeling of experimental results in second language learning.
A) The behavioral results of an experiment for the recognition of English words by three groups of native speakers of Italian who differed in their age of arrival in Canada: Early, Mid and Late arrival groups, also compared to a native English speaker (NE) group. Participants were asked to repeat as many words as possible after they heard an English sentence. Sentences were presented at different signal-to-noise ratios given in decibels (dB). Adapted from . B) The results of the learning and recognition simulations where we used the same speech samples as in the Word Recognition Task. The different age of arrival was modeled with different precision ratios at the first level. Recognition accuracy is measured in terms of normalized, total causal prediction errors during recognition relative to a baseline condition of −30 dB noise, i.e., recognition accuracy = 100*[(baseline prediction error-test prediction error)/baseline prediction error]. Note that we used different signal-to-noise ratios than the original experiment because best recognition results with our model were obtained at 30 dB, which corresponds to almost ideal recognition results in humans around 12 dB, and we scaled the remaining ratios accordingly. Each symbol represents the average recognition accuracy obtained from 10 digits where the stimulus was masked with noise at given signal-to-noise ratios.

Similar articles

Cited by

References

    1. Bolhuis JJ, Okanoya K, Scharff C (2010) Twitter evolution: converging mechanisms in birdsong and human speech. Nature Reviews Neuroscience 11: 747–759. - PubMed
    1. Doupe AJ, Kuhl PK (1999) Birdsong and human speech: Common themes and mechanisms. Annual Review of Neuroscience 22: 567–631. - PubMed
    1. Creutzfeldt O, Ojemann G, Lettich E (1989) Neuronal-Activity in the Human Lateral Temporal-Lobe .1. Responses to Speech. Experimental Brain Research 77: 451–475. - PubMed
    1. Pasley BN, David SV, Mesgarani N, Flinker A, Shamma SA, et al. (2012) Reconstructing Speech from Human Auditory Cortex. Plos Biology 10 (1) e1001251. - PMC - PubMed
    1. Berwick RC, Okanoya K, Beckers GJL, Bolhuis JJ (2011) Songs to syntax: the linguistics of birdsong. Trends in Cognitive Sciences 15: 113–121. - PubMed

Publication types