Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr;23(4):575-582.
doi: 10.1038/s41593-020-0608-8. Epub 2020 Mar 30.

Machine translation of cortical activity to text with an encoder-decoder framework

Affiliations

Machine translation of cortical activity to text with an encoder-decoder framework

Joseph G Makin et al. Nat Neurosci. 2020 Apr.

Abstract

A decade after speech was first decoded from human brain signals, accuracy and speed remain far below that of natural speech. Here we show how to decode the electrocorticogram with high accuracy and at natural-speech rates. Taking a cue from recent advances in machine translation, we train a recurrent neural network to encode each sentence-length sequence of neural activity into an abstract representation, and then to decode this representation, word by word, into an English sentence. For each participant, data consist of several spoken repeats of a set of 30-50 sentences, along with the contemporaneous signals from ~250 electrodes distributed over peri-Sylvian cortices. Average word error rates across a held-out repeat set are as low as 3%. Finally, we show how decoding with limited data can be improved with transfer learning, by training certain layers of the network under multiple participants' data.

PubMed Disclaimer

Conflict of interest statement

Competing interests

This work was funded in part by Facebook Reality Labs. UCSF holds patents related to speech decoding.

Figures

Fig. 1 |
Fig. 1 |. The decoding pipeline.
Each participant read sentences from one of two datasets (MOCHA-TIMIT, picture descriptions) while neural signals were recorded with an ECoG array (120–250 electrodes) covering peri-Sylvian cortices. The analytic amplitudes of the high-γ signals (70–150 Hz) were extracted at about 200 Hz, clipped to the length of the spoken sentences and supplied as input to an artificial neural network. The early stages of the network learn temporal convolutional filters that, additionally, effectively downsample these signals. Each filter maps data from 12-sample-wide windows across all electrodes (for example, the green window shown on the example high-γ signals in red) to single samples of a feature sequence (highlighted in the green square on the blue feature sequences); then slides by 12 input samples to produce the next sample of the feature sequence; and so on. One hundred feature sequences are produced in this way, and then passed to the encoder rNN, which learns to summarize them in a single hidden state. The encoder rNN is also trained to predict the MFCCs of the speech audio signal that temporally coincide with the ECoG data, although these are not used during testing (see “The decoder pipeline” for details). The final encoder hidden state initializes the decoder rNN, which learns to predict the next word in the sequence, given the previous word and its own current state. During testing, the previous predicted word is used instead.
Fig. 2 |
Fig. 2 |. WERs of the decoded sentences.
a, WErs for one participant under the encoder–decoder (first bar), four crippled variants thereof (bars 2–4 and 6) and a state-of-the-art sentence classifier based on ECoG-to-phoneme Viterbi decoding (phoneme-based HMM). No MFCCs, trained without requiring the encoder to predict MFCCs; low density, trained and tested on simulated lower-density grid (8-mm rather than 4-mm spacing); no conv., the network’s temporal convolution layer is replaced with a fully connected layer; length info. only, the input ECoG sequences are replaced with Gaussian noise, but of the correct length. The box and whiskers show, respectively, the quartiles and the extent (excepting outliers which are shown explicitly as black diamonds) of the distribution of WErs across n=30 networks trained independently from scratch and evaluated on randomly selected held-out blocks. Significance, indicated by asterisks (***P < 0.0005), was computed with a one-sided Wilcoxon signed-rank test and Holm–Bonferroni corrected for five comparisons. Exact P values appear in Supplementary Table 5. b, For four different participants, WEr as a function of the number of repeats of the sentence sets used for training; that is, the number of training tokens for each sentence type. results for MOCHA-1 (50 sentence types; see “results” for details) are shown in solid lines (pink, green, brown); for the picture descriptions (30 sentence types), in dashed lines (blue, brown). Note that participant d (brown) read from both sets. The endpoint of the pink curve corresponds to the first bar of a. Whiskers indicate standard errors of the mean WErs (vertical) and mean number of repeats (horizontal) across n=10 networks trained independently from scratch and evaluated on randomly selected held-out blocks (The number of repeats varies because data were divided on the basis of blocks, which vary slightly in length).
Fig. 3 |
Fig. 3 |. WER of the decoded MoCHA-1 sentences for encoder–decoder models trained with transfer learning.
Each panel corresponds to a participant (color code as in Fig. 2). The four boxes in each panel show WEr without transfer learning (‘encoder–decoder’, as in the final points in Fig. 2b), with cross-participant transfer learning (+participant TL), with training on sentences outside the test set (+task TL) and with both forms of transfer learning (+dual TL). The box and whiskers show, respectively, the quartiles and the extent (excepting outliers which are shown explicitly as black diamonds) of the distribution of WErs across n=30 networks trained independently from scratch and evaluated on randomly selected held-out blocks. Significance, indicated by asterisks (*P < 0.05; **P < 0.005; ***P < 0.0005; NS, not significant), was computed with a one-sided Wilcoxon signed-rank test and Holm–Bonferroni corrected for 14 comparisons: the 12 shown here plus two others noted in the text. Exact P values appear in Supplementary Table 6. a, participant a, with pretraining on participant b/pink (second and fourth bars). b, participant b, with pretraining on participant a/green (second and fourth bars). c, participant d, with pretraining on participant b/pink (second and fourth bars).
Fig. 4 |
Fig. 4 |. The contributions of each anatomical area to decoding, as measured by the gradient of the loss function with respect to the input data (see “Anatomical contributions” for details).
The contributions are broken down by participant, with the same color scheme as throughout (compare with Fig. 2). Each shaded area represents a kernel density estimate of the distribution of contributions of electrodes in a particular anatomical area; black dots indicate the raw contributions. The scale and ‘zero’ of these contributions were assumed to be incomparable across participants and, therefore, all data were rescaled to the same interval for each participant (smallest contribution at left, largest contribution at right). Missing densities (for example, temporal areas in participant c/blue) correspond to areas with no grid coverage. a.u., arbitrary units; IFG, inferior frontal gyrus.
Fig. 5 |
Fig. 5 |. Electrode coverage and contributions.
ad, Anatomical reconstructions of the four participants (colored frames indicating participant identity according to the color scheme used throughout), with the location of the ECoG electrodes indicated with colored discs. For each disc, the area indicates the electrode’s contribution to decoding (see the Methods), and the color indicates the anatomical region (see key).
Fig. 6 |
Fig. 6 |. Graphical model for the decoding process.
Circles represent random variables; doubled circles are deterministic functions of their inputs. a, The true generative process (above) and the encoder–decoder model (below). The true relationship between neural activity (N), the speech-audio signal (A) and word sequences (W), denoted P(a,wn), is unknown (although we have drawn the graph to suggest that W and A are independent given N). However, we can observe samples from all three variables, which we use to fit the conditional model, Q(a,w,sd(se),se(n)n;Θ), which is implemented as a neural network. The model separates the encoder states, Se, which directly generate the audio sequences, from the decoder states, Sd, which generate the word sequences. During training, model parameters Θ are changed so as to make the model distribution, Q, over A and W look more similar to the true distribution, P. b, Detail of the graphical model for the decoder, unrolled vertically in sequence steps. Each decoder state is computed deterministically from its predecessor and the previously generated word or (in the case of the zeroth state) the final encoder state and an initialization token, EOS.
Fig. 7 |
Fig. 7 |. Network architecture.
The encoder and decoder are shown unrolled in time—or, more precisely, sequence elements (columns). Thus, all layers (boxes) within the same row of the encoder or of the decoder have the same incoming and outgoing weights. The arrows in both directions indicate a bidirectional rNN (see “Implementation: architecture”). Although the figure depicts the temporal convolutions as eight-sample-wide convolutions (due to space constraints), all results are from networks with 12-sample-wide convolutions. The end-of-sequence token is denoted EOS.

Comment in

  • Translating the brain.
    Cogan GB. Cogan GB. Nat Neurosci. 2020 Apr;23(4):471-472. doi: 10.1038/s41593-020-0616-8. Nat Neurosci. 2020. PMID: 32231339 No abstract available.

References

    1. Nuyujukian P. et al. Cortical control of a tablet computer by people with paralysis. PLoS ONE 13, 1–16 (2018). - PMC - PubMed
    1. Gilja V. et al. Clinical translation of a high-performance neural prosthesis. Nat. Med 21, 1142–1145 (2015). - PMC - PubMed
    1. Jarosiewicz B. et al. Virtual typing by people with tetraplegia using a self-calibrating intracortical brain–computer interface. Sci. Transl. Med 7, 1–19 (2015). - PMC - PubMed
    1. Brumberg JS Kennedy PR & Guenther FH Artificial speech synthesizer control by brain–computer interface. In Interspeech, 636–639 (International Speech Communication Association, 2009).
    1. Brumberg JS, Wright EJ, Andreasen DS, Guenther FH & Kennedy PR Classification of intended phoneme production from chronic intracortical microelectrode recordings in speech-motor cortex. Front. Neuroeng 5, 1–12 (2011). - PMC - PubMed

Publication types

LinkOut - more resources