Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 28;21(7):e1013244.
doi: 10.1371/journal.pcbi.1013244. eCollection 2025 Jul.

Recurrent neural networks as neuro-computational models of human speech recognition

Affiliations

Recurrent neural networks as neuro-computational models of human speech recognition

Christian Brodbeck et al. PLoS Comput Biol. .

Abstract

Human speech recognition transforms a continuous acoustic signal into categorical linguistic units, by aggregating information that is distributed in time. It has been suggested that this kind of information processing may be understood through the computations of a Recurrent Neural Network (RNN) that receives input frame by frame, linearly in time, but builds an incremental representation of this input through a continually evolving internal state. While RNNs can simulate several key behavioral observations about human speech and language processing, it is unknown whether RNNs also develop computational dynamics that resemble human neural speech processing. Here we show that the internal dynamics of long short-term memory (LSTM) RNNs, trained to recognize speech from auditory spectrograms, predict human neural population responses to the same stimuli, beyond predictions from auditory features. Variations in the RNN architecture motivated by cognitive principles further improved this predictive power. Specifically, modifications that allow more human-like phonetic competition also led to more human-like temporal dynamics. Overall, our results suggest that RNNs provide plausible computational models of the cortical processes supporting human speech recognition.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. General design: predicting brain activity from a recurrent neural network (RNN).
(A) Human participants listened to spoken words, while magnetic fields were measured with magnetoencephalography (MEG). (B) A recurrent neural network (RNN), trained to recognize words in arbitrary sequences of words and silence, processed the same stimulus sequence that each human participant heard. (C) The RNN was trained to output the word that is currently being heard, for the whole duration of the word (“Training signal”; in the localist output shown, each color represents a different word in the output space; all outputs are set to 0, except for the word that is currently in the input, i.e., each colored box represents the output corresponding to one word being set to 1 for the duration of that word). In practice, this is impossible in the early time course, because information about word identity is distributed over time in the acoustic input. Instead, the RNN tends to activate several possible candidates before settling on the right word (“Model output”; words are sorted phonetically, thus words with similar color have a similar onset). (D) RNN activity over time was quantified as the sum of hidden unit magnitude and the sum of hidden unit magnitude increases. (E) These two signals were then used to predict the source-localized MEG responses from each participant, while controlling for the predictive power of acoustic features (a gammatone spectrogram, an acoustic onset spectrogram, and word onsets). Brain responses were predicted through multivariate temporal response function (mTRF) models.
Fig 2
Fig 2. Models for sparse output spaces better predict neural responses, and learn phonetic structure of the input lexicon.
(A) Architecture for RNNs with semantic output space. Word targets are dense vectors in a semantic vector space (GloVe). Each word has non-zero values in most or all output elements (blue lines). Output is evaluated using mean squared error (MSE). (B) RNN with sparse output, where targets are binary vectors, and each word is defined as 1 element (localist; shown) or 10 elements (sparse random vectors). In the illustration using a localist output space, each word leads to activation dominated by a single output element (different elements are distinguished by color). (C) Word error rate as a function of the numbers of hidden units, showing that all models can learn the task given enough hidden units. Error bars indicate within-subject standard error [34], treating each speaker as a subject (i.e., indicating the estimated standard error of the mean word error rate by speaker, calculated separately for the training and test sets) (D) Predictive power of each model for brain activity (quantified as % improvement in explained variance over an acoustic baseline model in the region of interest shaded in blue). All RNN models significantly add predictive power to the auditory-only model, but sparser RNN models have consistently higher predictive power than semantic vector space models. This difference seems to diminish with increasing number of hidden units. Error bars represent the within-subject standard error of the mean [34]. (E) Words commonly share the same acoustic-phonetic beginnings and can only be identified by considering all information across time. The plot illustrates this for conserve (first line): words lower in the graph share fewer word-initial phonemes. The point at which each word starts to differ from conserve is marked with a red line. (F) This acoustic-phonetic structure is reflected in the learned output mappings of localist models (dense layers): Words that share more onset phonemes are also located closer together in the output mappings. The graph shows the pairwise distance between words (x-axis) as a function of how many onset phonemes they share (y-axis). (G) Analogous analysis for the GloVe output space. The increased proximity of words sharing 4 + phonemes may be due to morphological structure (e.g., story and stories share onset phonemes because they have the same morphological root).
Fig 3
Fig 3. Depth and unit sub-grouping improve performance and neural prediction. (A) depth was added by stacking RNNs while controlling the number of trainable parameters (the number of hidden units per layer is indicated in white). (B) Deeper models have lower word error rate. (C) Deeper models also achieved better predictive power for brain responses, when units in each layer were used to create separate predictors (solid lines). However, increasing the number of predictors derived from flat models by K-means clustering also increased predictive power (dotted lines), with the highest predictive power for the flat localist model at K = 32. (D) K-means clustering applied to deep models (ignoring the differentiation of units into different layers). While depth generally improved predictive power of GloVe models, it did not seem to have a clear effect on localist models. (E) Acoustic-phonetic structure in the learned output mappings is found even in deep localist models, but it decreases as a function of depth (analogous to Fig 2F).
Fig 4
Fig 4. Modified loss function that enables phonetic competition improves performance and neural prediction.
(A) Left panel: Activation of cohort competitors in response to human speaker words is limited (shown for the 1 layer localist model). The plot shows model output values in the localist space, interpreted as lexical “activation”, as a function of time since word onset. Shown is the output for the target (yellow), and the average for non-target words, with color indicating how many word-initial phonemes the respective non-target shares with the target. Only data for the human speaker tokens are shown (including 2751 trained and 183 untrained tokens). Values range from 0 to 1 due to the sigmoid activation function. Right panel: The relative activation of the correct target, as a function of how many word-initial phonemes the target shares with at least one other word (indicated by color). Relative activation is calculated as activation of the target, divided by the sum of the total activation in the output space, at each time point. Because the activation of the target is divided by the total activation, this can be read as a probability estimate for the target. Words that share more phonemes with competitors should take longer to reach high probability. (B) Theoretical prediction for target choice (A, right panel), based on phonetic transcriptions of the human speaker tokens. Target probability for each word was defined as 1/ the number of words in the cohort based on shared phonemes. (C) Same as A, but for synthetic talkers, only including tokens used in model training. Responses exhibit somewhat increased cohort competition. (D) A subset of 1000 human speaker target tokens that were never presented during training (in a separately trained model; data for the same 1000 tokens that were used in the MEG experiment). The upper panels show data for all trials. While target activation is delayed compared to A, lexical activation is generally low. The lower panel shows only correct trials (26.8%). High relative target activation suggests that the model pursued a similar strategy as for trained human tokens, activating a single candidate. (E) Analog to B but for the subset of 1000 words used in D. (F) Responses to synthetic tokens not used during training exhibit more normal competition effects than the responses to human tokens. This may indicate that synthetic tokens are acoustically more homogeneous. (G) The loss function was modified to reduce the penalty incurred by activating non-target words early during word presentation. Plots show responses to all human speaker tokens (including previously seen tokens, as in A). The modified loss function led to more human-like activation of cohort competitors early during human speaker words (left column), and delayed the point at which activation of the target exceeded other words, with the expected gradation by how many onset phonemes that competitors share with the target (right column), more consistent with acoustic ambiguity of the word onsets (data from 1 layer models). (H) The modified loss function led to improved WERs, and (I) substantially increased predictive power for human brain activity. (J) Lexical activation in a deep model was comparable to that of the 1 layer model. (K) The new loss function led to more realistic lexical activation of untrained human speaker tokens (same procedure as for D). (L) Acoustic-phonetic structure in the output mappings (c = 1024, analogous to Fig 2F).

References

    1. Aertsen AMHJ, Johannesma PIM, Hermes DJ. Spectro-temporal receptive fields of auditory neurons in the grassfrog: II. Analysis of the stimulus-event relation for tonal stimuli. Biol Cybern. 1980;38(4):235–48. - PubMed
    1. Escabí MA, Schreiner CE. Nonlinear spectrotemporal sound analysis by neurons in the auditory midbrain. J Neurosci. 2002;22(10):4114–31. - PMC - PubMed
    1. Fishbach A, Nelken I, Yeshurun Y. Auditory edge detection: a neural model for physiological and psychoacoustical responses to amplitude transients. J Neurophysiol. 2001;85(6):2303–23. doi: 10.1152/jn.2001.85.6.2303 - DOI - PubMed
    1. Singer Y, Teramoto Y, Willmore BD, Schnupp JW, King AJ, Harper NS. Sensory cortex is optimized for prediction of future input. eLife [Internet]. 2018. [cited 2019 Feb 26];7. Available from: https://elifesciences.org/articles/31557 - PMC - PubMed
    1. Williamson RS, Ahrens MB, Linden JF, Sahani M. Input-specific gain modulation by local sensory context shapes cortical and thalamic responses to complex sounds. Neuron. 2016;91(2):467–81. doi: 10.1016/j.neuron.2016.05.041 - DOI - PMC - PubMed

LinkOut - more resources